blob: c2a949169123d291822c1698977cef7a139b0974 [file] [log] [blame]
<p></p>
<h1>XML Content Search in Eclipse Help </h1>
<p>Konrad Kolosowski and Dejan Glozic,
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%m/%d/%Y" startspan -->08/08/2005<!--webbot bot="Timestamp" endspan i-checksum="12630" --></p>
<h2>&nbsp;</h2>
<h2>Background</h2>
<p>Although Eclipse help requires HTML as the final presentation format, there
are no technical reason why documentation cannot be delivered in XML form and
transformed into HTML on demand. This approach provides for conditional
transformation that can produce different results depending on the context
attributes such as platform, target audience, context etc. The visual attributes
can also be controlled outside the content.</p>
<p>The purpose of this document is to address the problem of information search
in Eclipse help system when XML is used as a content format. Search indexing
will have to be performed on the original XML documents, while the search hits
obtained from the subsequent searches will point at the transformed HTML
documents. This document analyses problems that result from this separation and
offers possible solutions.</p>
<p>This document only lists ideas and is by no means an indication of the final
implementation. For this reason, please do not make dependencies or otherwise
factor in these ideas into your product plans.</p>
<p>In the following text, the acronym 'UA' will be used repeatedly to denote
'User Assistance'.</p>
<h2>The problem</h2>
<p>Eclipse 3.1 help system uses one instance of Lucene index per locale for
indexing all documentation. Multiple fields are used for title, summary, content
text analyzed in two ways (stemmed and exact search). Name field is used to
relate document to document identifiers understood by the help system - URLs.
More than one instance of an index exists only when different locale is used. In
the Infocenter installation, multiple indexes are in use simultaneously, but are
completely independent. The only two exceptions are that multiple locales are
synchronized to prevent memory spikes and locking shared configuration to
prevent index corruption by multiple Infocenters.</p>
<p>When indexing help content contributed by XML, keeping 3.1 search features
would be fairly easy. The only change required would be to delegate to a
different parser or document producer when encountering an XML document instead
of an HTML one. An enhancement would be that we will allow other UA components
(e.g. intro, cheatsheets) to be indexed, and displayable documents built from
XML, rather than have one XML file to one topic relationship.</p>
<p>The following thoughts assume we will allow XML as a format for Eclipse help,
but not completely unify format of content contribution across UA components.
Content and API will be reusable, thus opening possibility for search beyond
classic help system topic, across UA.</p>
<h3>The Lucene index structure</h3>
<p>Indexing information from components other than help system to the index can
be designed using multiple indexes, one for information from each component, or
single index for all information. Since the information is of similar type and
will likely be searched together, the results must be presented as a unified set
and it makes more sense to index all types of documentation using one index for
all components.</p>
<p>Since Lucene will serve multiple components, not just help system, it needs
to be separated from the tight control of the help system. The life cycle of
IndexWriter, IndexReader, IndexSearcher and stability of the index need to be
part of the Index component - Lucene plus a manager layer with APIs. Access to
the index should be through service like APIs, with participants registered
through extension point. This will allow index to manage concurrency and solicit
indexing of all components when search call is initiated, change in
contributions occurs or recovery from corruption is needed. </p>
<p>Lucene has no assumption of document format. A search result is a document
artifact that consists of any number of text fields containing text tokens.
Basic search query allows for searching for one term within one field. Searching
across fields is accomplished using complex queries built with knowledge of
relevant fields. We have flexibility of using additional fields for storing or
meta information, in text format. There is no built-in relationship between
fields, no structure, and no relationship between documents, making indexing of
structured information not straight forward if preserving structure or
relationship is important. </p>
<p>We must keep the number of searchable fields finite and low, to preserve good
performance. There are two natural approaches for indexing an XML document that
use single field that will later be searched. One is to extract all text from
the document and index in one field (another field needed to store the document
identifier). This is very simple, performing solution, but treats all text from
the document equal. Another approach is to treat each element text and each
element attribute values as units, and index every one as a dedicated document.
For this position to be beneficial, it requires an element context (its
containment within document and other element) to be stored, or an identifier
assigned that can be used to look up the corresponding XML document outside of
Lucene. </p>
<h3>Abstracting the indexable document</h3>
<p>Forgetting for a while about format of the source document, what makes sense
in Eclipse UA is to be able to retrieve document fragments that can be presented
on its own or are reusable pieces that can be displayed as part of different
pages or different places. The fact that the content originated from an XML file
should be irrelevant. Therefore XML content needs to be abstracted into a
fragment that will correspond to Lucene document that can be indexed. A sample
abstract IndexableDocument can be designed with the following methods: </p>
<p>String getContributor - containing id of index contributor, for examples
org.eclipse.ui.intro String getName() - containing unique document identifier
understood by a component, for example URL, XPath, or anything that will be
needed to retrieve document for displaying Reader getContent() - containing main
text content for search String getTitle() - containing higher importance text
for search String getKeywords() - containing optional keywords, synonyms for
search, present or not in the original document String getRawTitle() -
containing one line human readable name/title that can be later retrieved from
the index String getSummary() - containing optional, multiple line human
readable description of the content String getConstraints() - containing ids of
the constraints, like OS, or roles that this content applies to </p>
<p>More methods can be added if necessary, but allowing variable number of
fields based on XML schema would take away benefits of using common index
between components. Fields private to component depending on each XML definition
would practically equal having separate indexes and require different search
query for each document type, while theoretically all documents would be kept in
the same Lucene index/file. The benefits and performance of such design are
questionable. </p>
<p>Each index participant will register their own IndexableDocument factory
(parser, text extractor, digester whatever the name) with the index manager.
Upon requests for the document the factory will produce the document that can
easily be indexed. Assuming we will not be able to unify contribution format
across all UA components (old help, intro, and cheat sheets format may need to
be supported), each UA component may have their own and multiple of factories or
even allow for pluggable factories for undefined source document format. A
factory may exist for DITA XML, XHMTL, HTML, PDF or dynamically generated
content. From the implementation point of view, factories will mainly be
parsers. If there is a need the parsers can be parameterized and work of the
schema that is also used for displaying documents, but the challenge is to make
them fast, and withstand long documents without large memory consumption. </p>
<h3>Collating and filtering results</h3>
<p>If separate, each UA component needs to keep track of its own contribution,
be able to answer whether it requires indexing (either addition or deletion) of
documents, and produce a list of document IDs to add/remove upon indexing
triggered by any of components. This will allow for consistent index and
searches irrespective of component activation by the user.</p>
<p>Merging of search results from across components occurs automatically, as all
documents exist in the same Lucene index. However search results will need to be
passed through each indexing participant for filtering and converting of
IndexableDocument identifier to a handle to a document that can be presented to
the user. Each component would be responsible for filtering search results that
should be hidden from the user. For example enabling activity is a frequent
operation that should not result in an index change. Filtering based on
activities should occur post search, on search results. </p>
<p>Some filtering constraints with well defined set of values, for example OS,
WS, ARCH can be indexed with minimal overhead. If indexing occurred on the
client machine, documents not satisfying the constraints could be skipped and
not indexed at all, but taking advantage of prebuilt indexes requires such
content to be indexed and filtered on the client. Keeping such content
identified by indexed constraints will allow automatic filtering by
complimenting the search with additional boolean query for constraints field.
</p>
<p>Filtering based on NL should not occur. Given that UA content is human
readable it is almost 100% translatable and very small number of documents is
expected to be common between languages. For this reason index should contain
text for one locale only as in 3.1 help system.</p>
<p>Optionally, we can add additional filtering field for the UA component's own
use. The information there would not be further analyzed, but indexed as is.
When performing search, each UA component would have a chance of narrowing
search to its own contributions, and further by providing query to apply to
additional filtering field. For example, when invoked from a wizard, help system
search may provide a query to mandate additional filtering field of the results
to contain “task”. The information can be partitioned this way into subsets not
necessary understood by all parts of UA. </p>
<p>One of the things expected in future versions of help are dynamically
revealed links to related documents, when they are present, without showing
broken links all the time. Not trying to support searches for references, it
does not look necessary to index links or their description. Omitting links from
indexing will ensure that the target document is found when it exists, and not
the document that refers to it.</p>
<h3>Handling prebuilt indexes</h3>
<p>Since index will be shared among components, producing prebuilt indexes will
be more challenging. For easiest maintenance and consistency, it is best for the
Index component to be responsible for generating and merging prebuilt indexes.
It will require that each component will handle and document factories are able
to work with documents in the workspace, or other installation of Eclipse.</p>
<p>It is impossible to come up with a perfect ranking algorithm. The more
structural differences between documents, the more complicated index is and more
difficult to design optimal ranking algorithm. Help system search up to 3.1
suffers from document length having too great an influence on the ranking.
Shorter documents with N matches rank much higher than long documents with the
same N number of matches in text. It may be necessary to implement custom Scorer
for UA searches.</p>