| <p></p> |
| <h1>XML Content Search in Eclipse Help </h1> |
| <p>Konrad Kolosowski and Dejan Glozic, |
| <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%m/%d/%Y" startspan -->08/08/2005<!--webbot bot="Timestamp" endspan i-checksum="12630" --></p> |
| <h2> </h2> |
| <h2>Background</h2> |
| <p>Although Eclipse help requires HTML as the final presentation format, there |
| are no technical reason why documentation cannot be delivered in XML form and |
| transformed into HTML on demand. This approach provides for conditional |
| transformation that can produce different results depending on the context |
| attributes such as platform, target audience, context etc. The visual attributes |
| can also be controlled outside the content.</p> |
| <p>The purpose of this document is to address the problem of information search |
| in Eclipse help system when XML is used as a content format. Search indexing |
| will have to be performed on the original XML documents, while the search hits |
| obtained from the subsequent searches will point at the transformed HTML |
| documents. This document analyses problems that result from this separation and |
| offers possible solutions.</p> |
| <p>This document only lists ideas and is by no means an indication of the final |
| implementation. For this reason, please do not make dependencies or otherwise |
| factor in these ideas into your product plans.</p> |
| <p>In the following text, the acronym 'UA' will be used repeatedly to denote |
| 'User Assistance'.</p> |
| <h2>The problem</h2> |
| <p>Eclipse 3.1 help system uses one instance of Lucene index per locale for |
| indexing all documentation. Multiple fields are used for title, summary, content |
| text analyzed in two ways (stemmed and exact search). Name field is used to |
| relate document to document identifiers understood by the help system - URLs. |
| More than one instance of an index exists only when different locale is used. In |
| the Infocenter installation, multiple indexes are in use simultaneously, but are |
| completely independent. The only two exceptions are that multiple locales are |
| synchronized to prevent memory spikes and locking shared configuration to |
| prevent index corruption by multiple Infocenters.</p> |
| <p>When indexing help content contributed by XML, keeping 3.1 search features |
| would be fairly easy. The only change required would be to delegate to a |
| different parser or document producer when encountering an XML document instead |
| of an HTML one. An enhancement would be that we will allow other UA components |
| (e.g. intro, cheatsheets) to be indexed, and displayable documents built from |
| XML, rather than have one XML file to one topic relationship.</p> |
| <p>The following thoughts assume we will allow XML as a format for Eclipse help, |
| but not completely unify format of content contribution across UA components. |
| Content and API will be reusable, thus opening possibility for search beyond |
| classic help system topic, across UA.</p> |
| <h3>The Lucene index structure</h3> |
| <p>Indexing information from components other than help system to the index can |
| be designed using multiple indexes, one for information from each component, or |
| single index for all information. Since the information is of similar type and |
| will likely be searched together, the results must be presented as a unified set |
| and it makes more sense to index all types of documentation using one index for |
| all components.</p> |
| <p>Since Lucene will serve multiple components, not just help system, it needs |
| to be separated from the tight control of the help system. The life cycle of |
| IndexWriter, IndexReader, IndexSearcher and stability of the index need to be |
| part of the Index component - Lucene plus a manager layer with APIs. Access to |
| the index should be through service like APIs, with participants registered |
| through extension point. This will allow index to manage concurrency and solicit |
| indexing of all components when search call is initiated, change in |
| contributions occurs or recovery from corruption is needed. </p> |
| <p>Lucene has no assumption of document format. A search result is a document |
| artifact that consists of any number of text fields containing text tokens. |
| Basic search query allows for searching for one term within one field. Searching |
| across fields is accomplished using complex queries built with knowledge of |
| relevant fields. We have flexibility of using additional fields for storing or |
| meta information, in text format. There is no built-in relationship between |
| fields, no structure, and no relationship between documents, making indexing of |
| structured information not straight forward if preserving structure or |
| relationship is important. </p> |
| <p>We must keep the number of searchable fields finite and low, to preserve good |
| performance. There are two natural approaches for indexing an XML document that |
| use single field that will later be searched. One is to extract all text from |
| the document and index in one field (another field needed to store the document |
| identifier). This is very simple, performing solution, but treats all text from |
| the document equal. Another approach is to treat each element text and each |
| element attribute values as units, and index every one as a dedicated document. |
| For this position to be beneficial, it requires an element context (its |
| containment within document and other element) to be stored, or an identifier |
| assigned that can be used to look up the corresponding XML document outside of |
| Lucene. </p> |
| <h3>Abstracting the indexable document</h3> |
| <p>Forgetting for a while about format of the source document, what makes sense |
| in Eclipse UA is to be able to retrieve document fragments that can be presented |
| on its own or are reusable pieces that can be displayed as part of different |
| pages or different places. The fact that the content originated from an XML file |
| should be irrelevant. Therefore XML content needs to be abstracted into a |
| fragment that will correspond to Lucene document that can be indexed. A sample |
| abstract IndexableDocument can be designed with the following methods: </p> |
| <p>String getContributor - containing id of index contributor, for examples |
| org.eclipse.ui.intro String getName() - containing unique document identifier |
| understood by a component, for example URL, XPath, or anything that will be |
| needed to retrieve document for displaying Reader getContent() - containing main |
| text content for search String getTitle() - containing higher importance text |
| for search String getKeywords() - containing optional keywords, synonyms for |
| search, present or not in the original document String getRawTitle() - |
| containing one line human readable name/title that can be later retrieved from |
| the index String getSummary() - containing optional, multiple line human |
| readable description of the content String getConstraints() - containing ids of |
| the constraints, like OS, or roles that this content applies to </p> |
| <p>More methods can be added if necessary, but allowing variable number of |
| fields based on XML schema would take away benefits of using common index |
| between components. Fields private to component depending on each XML definition |
| would practically equal having separate indexes and require different search |
| query for each document type, while theoretically all documents would be kept in |
| the same Lucene index/file. The benefits and performance of such design are |
| questionable. </p> |
| <p>Each index participant will register their own IndexableDocument factory |
| (parser, text extractor, digester whatever the name) with the index manager. |
| Upon requests for the document the factory will produce the document that can |
| easily be indexed. Assuming we will not be able to unify contribution format |
| across all UA components (old help, intro, and cheat sheets format may need to |
| be supported), each UA component may have their own and multiple of factories or |
| even allow for pluggable factories for undefined source document format. A |
| factory may exist for DITA XML, XHMTL, HTML, PDF or dynamically generated |
| content. From the implementation point of view, factories will mainly be |
| parsers. If there is a need the parsers can be parameterized and work of the |
| schema that is also used for displaying documents, but the challenge is to make |
| them fast, and withstand long documents without large memory consumption. </p> |
| <h3>Collating and filtering results</h3> |
| <p>If separate, each UA component needs to keep track of its own contribution, |
| be able to answer whether it requires indexing (either addition or deletion) of |
| documents, and produce a list of document IDs to add/remove upon indexing |
| triggered by any of components. This will allow for consistent index and |
| searches irrespective of component activation by the user.</p> |
| <p>Merging of search results from across components occurs automatically, as all |
| documents exist in the same Lucene index. However search results will need to be |
| passed through each indexing participant for filtering and converting of |
| IndexableDocument identifier to a handle to a document that can be presented to |
| the user. Each component would be responsible for filtering search results that |
| should be hidden from the user. For example enabling activity is a frequent |
| operation that should not result in an index change. Filtering based on |
| activities should occur post search, on search results. </p> |
| <p>Some filtering constraints with well defined set of values, for example OS, |
| WS, ARCH can be indexed with minimal overhead. If indexing occurred on the |
| client machine, documents not satisfying the constraints could be skipped and |
| not indexed at all, but taking advantage of prebuilt indexes requires such |
| content to be indexed and filtered on the client. Keeping such content |
| identified by indexed constraints will allow automatic filtering by |
| complimenting the search with additional boolean query for constraints field. |
| </p> |
| <p>Filtering based on NL should not occur. Given that UA content is human |
| readable it is almost 100% translatable and very small number of documents is |
| expected to be common between languages. For this reason index should contain |
| text for one locale only as in 3.1 help system.</p> |
| <p>Optionally, we can add additional filtering field for the UA component's own |
| use. The information there would not be further analyzed, but indexed as is. |
| When performing search, each UA component would have a chance of narrowing |
| search to its own contributions, and further by providing query to apply to |
| additional filtering field. For example, when invoked from a wizard, help system |
| search may provide a query to mandate additional filtering field of the results |
| to contain task. The information can be partitioned this way into subsets not |
| necessary understood by all parts of UA. </p> |
| <p>One of the things expected in future versions of help are dynamically |
| revealed links to related documents, when they are present, without showing |
| broken links all the time. Not trying to support searches for references, it |
| does not look necessary to index links or their description. Omitting links from |
| indexing will ensure that the target document is found when it exists, and not |
| the document that refers to it.</p> |
| <h3>Handling prebuilt indexes</h3> |
| <p>Since index will be shared among components, producing prebuilt indexes will |
| be more challenging. For easiest maintenance and consistency, it is best for the |
| Index component to be responsible for generating and merging prebuilt indexes. |
| It will require that each component will handle and document factories are able |
| to work with documents in the workspace, or other installation of Eclipse.</p> |
| <p>It is impossible to come up with a perfect ranking algorithm. The more |
| structural differences between documents, the more complicated index is and more |
| difficult to design optimal ranking algorithm. Help system search up to 3.1 |
| suffers from document length having too great an influence on the ranking. |
| Shorter documents with N matches rank much higher than long documents with the |
| same N number of matches in text. It may be necessary to implement custom Scorer |
| for UA searches.</p> |