platform-ua/proposals/xmlsearch/xmlsearch.htm - www.eclipse.org/eclipse - Git at Google

 <p></p>
 <h1>XML Content Search in Eclipse Help </h1>
 <p>Konrad Kolosowski and Dejan Glozic,
 <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%m/%d/%Y" startspan -->08/08/2005<!--webbot bot="Timestamp" endspan i-checksum="12630" --></p>
 <h2>&nbsp;</h2>
 <h2>Background</h2>
 <p>Although Eclipse help requires HTML as the final presentation format, there
 are no technical reason why documentation cannot be delivered in XML form and
 transformed into HTML on demand. This approach provides for conditional
 transformation that can produce different results depending on the context
 attributes such as platform, target audience, context etc. The visual attributes
 can also be controlled outside the content.</p>
 <p>The purpose of this document is to address the problem of information search
 in Eclipse help system when XML is used as a content format. Search indexing
 will have to be performed on the original XML documents, while the search hits
 obtained from the subsequent searches will point at the transformed HTML
 documents. This document analyses problems that result from this separation and
 offers possible solutions.</p>
 <p>This document only lists ideas and is by no means an indication of the final
 implementation. For this reason, please do not make dependencies or otherwise
 factor in these ideas into your product plans.</p>
 <p>In the following text, the acronym 'UA' will be used repeatedly to denote
 'User Assistance'.</p>
 <h2>The problem</h2>
 <p>Eclipse 3.1 help system uses one instance of Lucene index per locale for
 indexing all documentation. Multiple fields are used for title, summary, content
 text analyzed in two ways (stemmed and exact search). Name field is used to
 relate document to document identifiers understood by the help system - URLs.
 More than one instance of an index exists only when different locale is used. In
 the Infocenter installation, multiple indexes are in use simultaneously, but are
 completely independent. The only two exceptions are that multiple locales are
 synchronized to prevent memory spikes and locking shared configuration to
 prevent index corruption by multiple Infocenters.</p>
 <p>When indexing help content contributed by XML, keeping 3.1 search features
 would be fairly easy. The only change required would be to delegate to a
 different parser or document producer when encountering an XML document instead
 of an HTML one. An enhancement would be that we will allow other UA components
 (e.g. intro, cheatsheets) to be indexed, and displayable documents built from
 XML, rather than have one XML file to one topic relationship.</p>
 <p>The following thoughts assume we will allow XML as a format for Eclipse help,
 but not completely unify format of content contribution across UA components.
 Content and API will be reusable, thus opening possibility for search beyond
 classic help system topic, across UA.</p>
 <h3>The Lucene index structure</h3>
 <p>Indexing information from components other than help system to the index can
 be designed using multiple indexes, one for information from each component, or
 single index for all information. Since the information is of similar type and
 will likely be searched together, the results must be presented as a unified set
 and it makes more sense to index all types of documentation using one index for
 all components.</p>
 <p>Since Lucene will serve multiple components, not just help system, it needs
 to be separated from the tight control of the help system. The life cycle of
 IndexWriter, IndexReader, IndexSearcher and stability of the index need to be
 part of the Index component - Lucene plus a manager layer with APIs. Access to
 the index should be through service like APIs, with participants registered
 through extension point. This will allow index to manage concurrency and solicit
 indexing of all components when search call is initiated, change in
 contributions occurs or recovery from corruption is needed. </p>
 <p>Lucene has no assumption of document format. A search result is a document
 artifact that consists of any number of text fields containing text tokens.
 Basic search query allows for searching for one term within one field. Searching
 across fields is accomplished using complex queries built with knowledge of
 relevant fields. We have flexibility of using additional fields for storing or
 meta information, in text format. There is no built-in relationship between
 fields, no structure, and no relationship between documents, making indexing of
 structured information not straight forward if preserving structure or
 relationship is important. </p>
 <p>We must keep the number of searchable fields finite and low, to preserve good
 performance. There are two natural approaches for indexing an XML document that
 use single field that will later be searched. One is to extract all text from
 the document and index in one field (another field needed to store the document
 identifier). This is very simple, performing solution, but treats all text from
 the document equal. Another approach is to treat each element text and each
 element attribute values as units, and index every one as a dedicated document.
 For this position to be beneficial, it requires an element context (its
 containment within document and other element) to be stored, or an identifier
 assigned that can be used to look up the corresponding XML document outside of
 Lucene. </p>
 <h3>Abstracting the indexable document</h3>
 <p>Forgetting for a while about format of the source document, what makes sense
 in Eclipse UA is to be able to retrieve document fragments that can be presented
 on its own or are reusable pieces that can be displayed as part of different
 pages or different places. The fact that the content originated from an XML file
 should be irrelevant. Therefore XML content needs to be abstracted into a
 fragment that will correspond to Lucene document that can be indexed. A sample
 abstract IndexableDocument can be designed with the following methods: </p>
 <p>String getContributor - containing id of index contributor, for examples
 org.eclipse.ui.intro String getName() - containing unique document identifier
 understood by a component, for example URL, XPath, or anything that will be
 needed to retrieve document for displaying Reader getContent() - containing main
 text content for search String getTitle() - containing higher importance text
 for search String getKeywords() - containing optional keywords, synonyms for
 search, present or not in the original document String getRawTitle() -
 containing one line human readable name/title that can be later retrieved from
 the index String getSummary() - containing optional, multiple line human
 readable description of the content String getConstraints() - containing ids of
 the constraints, like OS, or roles that this content applies to </p>
 <p>More methods can be added if necessary, but allowing variable number of
 fields based on XML schema would take away benefits of using common index
 between components. Fields private to component depending on each XML definition
 would practically equal having separate indexes and require different search
 query for each document type, while theoretically all documents would be kept in
 the same Lucene index/file. The benefits and performance of such design are
 questionable. </p>
 <p>Each index participant will register their own IndexableDocument factory
 (parser, text extractor, digester whatever the name) with the index manager.
 Upon requests for the document the factory will produce the document that can
 easily be indexed. Assuming we will not be able to unify contribution format
 across all UA components (old help, intro, and cheat sheets format may need to
 be supported), each UA component may have their own and multiple of factories or
 even allow for pluggable factories for undefined source document format. A
 factory may exist for DITA XML, XHMTL, HTML, PDF or dynamically generated
 content. From the implementation point of view, factories will mainly be
 parsers. If there is a need the parsers can be parameterized and work of the
 schema that is also used for displaying documents, but the challenge is to make
 them fast, and withstand long documents without large memory consumption. </p>
 <h3>Collating and filtering results</h3>
 <p>If separate, each UA component needs to keep track of its own contribution,
 be able to answer whether it requires indexing (either addition or deletion) of
 documents, and produce a list of document IDs to add/remove upon indexing
 triggered by any of components. This will allow for consistent index and
 searches irrespective of component activation by the user.</p>
 <p>Merging of search results from across components occurs automatically, as all
 documents exist in the same Lucene index. However search results will need to be
 passed through each indexing participant for filtering and converting of
 IndexableDocument identifier to a handle to a document that can be presented to
 the user. Each component would be responsible for filtering search results that
 should be hidden from the user. For example enabling activity is a frequent
 operation that should not result in an index change. Filtering based on
 activities should occur post search, on search results. </p>
 <p>Some filtering constraints with well defined set of values, for example OS,
 WS, ARCH can be indexed with minimal overhead. If indexing occurred on the
 client machine, documents not satisfying the constraints could be skipped and
 not indexed at all, but taking advantage of prebuilt indexes requires such
 content to be indexed and filtered on the client. Keeping such content
 identified by indexed constraints will allow automatic filtering by
 complimenting the search with additional boolean query for constraints field.
 </p>
 <p>Filtering based on NL should not occur. Given that UA content is human
 readable it is almost 100% translatable and very small number of documents is
 expected to be common between languages. For this reason index should contain
 text for one locale only as in 3.1 help system.</p>
 <p>Optionally, we can add additional filtering field for the UA component's own
 use. The information there would not be further analyzed, but indexed as is.
 When performing search, each UA component would have a chance of narrowing
 search to its own contributions, and further by providing query to apply to
 additional filtering field. For example, when invoked from a wizard, help system
 search may provide a query to mandate additional filtering field of the results
 to contain task. The information can be partitioned this way into subsets not
 necessary understood by all parts of UA. </p>
 <p>One of the things expected in future versions of help are dynamically
 revealed links to related documents, when they are present, without showing
 broken links all the time. Not trying to support searches for references, it
 does not look necessary to index links or their description. Omitting links from
 indexing will ensure that the target document is found when it exists, and not
 the document that refers to it.</p>
 <h3>Handling prebuilt indexes</h3>
 <p>Since index will be shared among components, producing prebuilt indexes will
 be more challenging. For easiest maintenance and consistency, it is best for the
 Index component to be responsible for generating and merging prebuilt indexes.
 It will require that each component will handle and document factories are able
 to work with documents in the workspace, or other installation of Eclipse.</p>
 <p>It is impossible to come up with a perfect ranking algorithm. The more
 structural differences between documents, the more complicated index is and more
 difficult to design optimal ranking algorithm. Help system search up to 3.1
 suffers from document length having too great an influence on the ranking.
 Shorter documents with N matches rank much higher than long documents with the
 same N number of matches in text. It may be necessary to implement custom Scorer
 for UA searches.</p>
	<p></p>
	<h1>XML Content Search in Eclipse Help </h1>
	<p>Konrad Kolosowski and Dejan Glozic,
	<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%m/%d/%Y" startspan -->08/08/2005<!--webbot bot="Timestamp" endspan i-checksum="12630" --></p>
	<h2> </h2>
	<h2>Background</h2>
	<p>Although Eclipse help requires HTML as the final presentation format, there
	are no technical reason why documentation cannot be delivered in XML form and
	transformed into HTML on demand. This approach provides for conditional
	transformation that can produce different results depending on the context
	attributes such as platform, target audience, context etc. The visual attributes
	can also be controlled outside the content.</p>
	<p>The purpose of this document is to address the problem of information search
	in Eclipse help system when XML is used as a content format. Search indexing
	will have to be performed on the original XML documents, while the search hits
	obtained from the subsequent searches will point at the transformed HTML
	documents. This document analyses problems that result from this separation and
	offers possible solutions.</p>
	<p>This document only lists ideas and is by no means an indication of the final
	implementation. For this reason, please do not make dependencies or otherwise
	factor in these ideas into your product plans.</p>
	<p>In the following text, the acronym 'UA' will be used repeatedly to denote
	'User Assistance'.</p>
	<h2>The problem</h2>
	<p>Eclipse 3.1 help system uses one instance of Lucene index per locale for
	indexing all documentation. Multiple fields are used for title, summary, content
	text analyzed in two ways (stemmed and exact search). Name field is used to
	relate document to document identifiers understood by the help system - URLs.
	More than one instance of an index exists only when different locale is used. In
	the Infocenter installation, multiple indexes are in use simultaneously, but are
	completely independent. The only two exceptions are that multiple locales are
	synchronized to prevent memory spikes and locking shared configuration to
	prevent index corruption by multiple Infocenters.</p>
	<p>When indexing help content contributed by XML, keeping 3.1 search features
	would be fairly easy. The only change required would be to delegate to a
	different parser or document producer when encountering an XML document instead
	of an HTML one. An enhancement would be that we will allow other UA components
	(e.g. intro, cheatsheets) to be indexed, and displayable documents built from
	XML, rather than have one XML file to one topic relationship.</p>
	<p>The following thoughts assume we will allow XML as a format for Eclipse help,
	but not completely unify format of content contribution across UA components.
	Content and API will be reusable, thus opening possibility for search beyond
	classic help system topic, across UA.</p>
	<h3>The Lucene index structure</h3>
	<p>Indexing information from components other than help system to the index can
	be designed using multiple indexes, one for information from each component, or
	single index for all information. Since the information is of similar type and
	will likely be searched together, the results must be presented as a unified set
	and it makes more sense to index all types of documentation using one index for
	all components.</p>
	<p>Since Lucene will serve multiple components, not just help system, it needs
	to be separated from the tight control of the help system. The life cycle of
	IndexWriter, IndexReader, IndexSearcher and stability of the index need to be
	part of the Index component - Lucene plus a manager layer with APIs. Access to
	the index should be through service like APIs, with participants registered
	through extension point. This will allow index to manage concurrency and solicit
	indexing of all components when search call is initiated, change in
	contributions occurs or recovery from corruption is needed. </p>
	<p>Lucene has no assumption of document format. A search result is a document
	artifact that consists of any number of text fields containing text tokens.
	Basic search query allows for searching for one term within one field. Searching
	across fields is accomplished using complex queries built with knowledge of
	relevant fields. We have flexibility of using additional fields for storing or
	meta information, in text format. There is no built-in relationship between
	fields, no structure, and no relationship between documents, making indexing of
	structured information not straight forward if preserving structure or
	relationship is important. </p>
	<p>We must keep the number of searchable fields finite and low, to preserve good
	performance. There are two natural approaches for indexing an XML document that
	use single field that will later be searched. One is to extract all text from
	the document and index in one field (another field needed to store the document
	identifier). This is very simple, performing solution, but treats all text from
	the document equal. Another approach is to treat each element text and each
	element attribute values as units, and index every one as a dedicated document.
	For this position to be beneficial, it requires an element context (its
	containment within document and other element) to be stored, or an identifier
	assigned that can be used to look up the corresponding XML document outside of
	Lucene. </p>
	<h3>Abstracting the indexable document</h3>
	<p>Forgetting for a while about format of the source document, what makes sense
	in Eclipse UA is to be able to retrieve document fragments that can be presented
	on its own or are reusable pieces that can be displayed as part of different
	pages or different places. The fact that the content originated from an XML file
	should be irrelevant. Therefore XML content needs to be abstracted into a
	fragment that will correspond to Lucene document that can be indexed. A sample
	abstract IndexableDocument can be designed with the following methods: </p>
	<p>String getContributor - containing id of index contributor, for examples
	org.eclipse.ui.intro String getName() - containing unique document identifier
	understood by a component, for example URL, XPath, or anything that will be
	needed to retrieve document for displaying Reader getContent() - containing main
	text content for search String getTitle() - containing higher importance text
	for search String getKeywords() - containing optional keywords, synonyms for
	search, present or not in the original document String getRawTitle() -
	containing one line human readable name/title that can be later retrieved from
	the index String getSummary() - containing optional, multiple line human
	readable description of the content String getConstraints() - containing ids of
	the constraints, like OS, or roles that this content applies to </p>
	<p>More methods can be added if necessary, but allowing variable number of
	fields based on XML schema would take away benefits of using common index
	between components. Fields private to component depending on each XML definition
	would practically equal having separate indexes and require different search
	query for each document type, while theoretically all documents would be kept in
	the same Lucene index/file. The benefits and performance of such design are
	questionable. </p>
	<p>Each index participant will register their own IndexableDocument factory
	(parser, text extractor, digester whatever the name) with the index manager.
	Upon requests for the document the factory will produce the document that can
	easily be indexed. Assuming we will not be able to unify contribution format
	across all UA components (old help, intro, and cheat sheets format may need to
	be supported), each UA component may have their own and multiple of factories or
	even allow for pluggable factories for undefined source document format. A
	factory may exist for DITA XML, XHMTL, HTML, PDF or dynamically generated
	content. From the implementation point of view, factories will mainly be
	parsers. If there is a need the parsers can be parameterized and work of the
	schema that is also used for displaying documents, but the challenge is to make
	them fast, and withstand long documents without large memory consumption. </p>
	<h3>Collating and filtering results</h3>
	<p>If separate, each UA component needs to keep track of its own contribution,
	be able to answer whether it requires indexing (either addition or deletion) of
	documents, and produce a list of document IDs to add/remove upon indexing
	triggered by any of components. This will allow for consistent index and
	searches irrespective of component activation by the user.</p>
	<p>Merging of search results from across components occurs automatically, as all
	documents exist in the same Lucene index. However search results will need to be
	passed through each indexing participant for filtering and converting of
	IndexableDocument identifier to a handle to a document that can be presented to
	the user. Each component would be responsible for filtering search results that
	should be hidden from the user. For example enabling activity is a frequent
	operation that should not result in an index change. Filtering based on
	activities should occur post search, on search results. </p>
	<p>Some filtering constraints with well defined set of values, for example OS,
	WS, ARCH can be indexed with minimal overhead. If indexing occurred on the
	client machine, documents not satisfying the constraints could be skipped and
	not indexed at all, but taking advantage of prebuilt indexes requires such
	content to be indexed and filtered on the client. Keeping such content
	identified by indexed constraints will allow automatic filtering by
	complimenting the search with additional boolean query for constraints field.
	</p>
	<p>Filtering based on NL should not occur. Given that UA content is human
	readable it is almost 100% translatable and very small number of documents is
	expected to be common between languages. For this reason index should contain
	text for one locale only as in 3.1 help system.</p>
	<p>Optionally, we can add additional filtering field for the UA component's own
	use. The information there would not be further analyzed, but indexed as is.
	When performing search, each UA component would have a chance of narrowing
	search to its own contributions, and further by providing query to apply to
	additional filtering field. For example, when invoked from a wizard, help system
	search may provide a query to mandate additional filtering field of the results
	to contain task. The information can be partitioned this way into subsets not
	necessary understood by all parts of UA. </p>
	<p>One of the things expected in future versions of help are dynamically
	revealed links to related documents, when they are present, without showing
	broken links all the time. Not trying to support searches for references, it
	does not look necessary to index links or their description. Omitting links from
	indexing will ensure that the target document is found when it exists, and not
	the document that refers to it.</p>
	<h3>Handling prebuilt indexes</h3>
	<p>Since index will be shared among components, producing prebuilt indexes will
	be more challenging. For easiest maintenance and consistency, it is best for the
	Index component to be responsible for generating and merging prebuilt indexes.
	It will require that each component will handle and document factories are able
	to work with documents in the workspace, or other installation of Eclipse.</p>
	<p>It is impossible to come up with a perfect ranking algorithm. The more
	structural differences between documents, the more complicated index is and more
	difficult to design optimal ranking algorithm. Help system search up to 3.1
	suffers from document length having too great an influence on the ranking.
	Shorter documents with N matches rank much higher than long documents with the
	same N number of matches in text. It may be necessary to implement custom Scorer
	for UA searches.</p>