| <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> |
| </head> |
| <body> |
| This package contains the various ranking models that can be used in Lucene. The |
| abstract class {@link org.apache.lucene.search.similarities.Similarity} serves |
| as the base for ranking functions. For searching, users can employ the models |
| already implemented or create their own by extending one of the classes in this |
| package. |
| |
| <h2>Table Of Contents</h2> |
| <p> |
| <ol> |
| <li><a href="#sims">Summary of the Ranking Methods</a></li> |
| <li><a href="#changingSimilarity">Changing the Similarity</a></li> |
| </ol> |
| </p> |
| |
| |
| <a name="sims"></a> |
| <h2>Summary of the Ranking Methods</h2> |
| |
| <p>{@link org.apache.lucene.search.similarities.DefaultSimilarity} is the original Lucene |
| scoring function. It is based on a highly optimized |
| <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>. For more |
| information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}.</p> |
| |
| <p>{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized |
| implementation of the successful Okapi BM25 model.</p> |
| |
| <p>{@link org.apache.lucene.search.similarities.SimilarityBase} provides a basic |
| implementation of the Similarity contract and exposes a highly simplified |
| interface, which makes it an ideal starting point for new ranking functions. |
| Lucene ships the following methods built on |
| {@link org.apache.lucene.search.similarities.SimilarityBase}: |
| |
| <a name="framework"></a> |
| <ul> |
| <li>Amati and Rijsbergen's {@linkplain org.apache.lucene.search.similarities.DFRSimilarity DFR} framework;</li> |
| <li>Clinchant and Gaussier's {@linkplain org.apache.lucene.search.similarities.IBSimilarity Information-based models} |
| for IR;</li> |
| <li>The implementation of two {@linkplain org.apache.lucene.search.similarities.LMSimilarity language models} from |
| Zhai and Lafferty's paper.</li> |
| </ul> |
| |
| Since {@link org.apache.lucene.search.similarities.SimilarityBase} is not |
| optimized to the same extent as |
| {@link org.apache.lucene.search.similarities.DefaultSimilarity} and |
| {@link org.apache.lucene.search.similarities.BM25Similarity}, a difference in |
| performance is to be expected when using the methods listed above. However, |
| optimizations can always be implemented in subclasses; see |
| <a href="#changingSimilarity">below</a>.</p> |
| |
| <a name="changingSimilarity"></a> |
| <h2>Changing Similarity</h2> |
| |
| <p>Chances are the available Similarities are sufficient for all |
| your searching needs. |
| However, in some applications it may be necessary to customize your <a |
| href="Similarity.html">Similarity</a> implementation. For instance, some |
| applications do not need to |
| distinguish between shorter and longer documents (see <a |
| href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p> |
| |
| <p>To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and |
| searching, and the changes must happen before |
| either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it |
| just isn't well-defined what is going to happen. |
| </p> |
| |
| <p>To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely |
| you'll want to simply subclass an existing method, be it |
| {@link org.apache.lucene.search.similarities.DefaultSimilarity} or a descendant of |
| {@link org.apache.lucene.search.similarities.SimilarityBase}), and |
| then register the new class by calling |
| {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)} |
| before indexing and |
| {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)} |
| before searching. |
| </p> |
| |
| <h3>Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}</h3> |
| <p> |
| The easiest way to quickly implement a new ranking method is to extend |
| {@link org.apache.lucene.search.similarities.SimilarityBase}, which provides |
| basic implementations for the low level . Subclasses are only required to |
| implement the {@link org.apache.lucene.search.similarities.SimilarityBase#score(BasicStats, float, float)} |
| and {@link org.apache.lucene.search.similarities.SimilarityBase#toString()} |
| methods.</p> |
| |
| <p>Another option is to extend one of the <a href="#framework">frameworks</a> |
| based on {@link org.apache.lucene.search.similarities.SimilarityBase}. These |
| Similarities are implemented modularly, e.g. |
| {@link org.apache.lucene.search.similarities.DFRSimilarity} delegates |
| computation of the three parts of its formula to the classes |
| {@link org.apache.lucene.search.similarities.BasicModel}, |
| {@link org.apache.lucene.search.similarities.AfterEffect} and |
| {@link org.apache.lucene.search.similarities.Normalization}. Instead of |
| subclassing the Similarity, one can simply introduce a new basic model and tell |
| {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.</p> |
| |
| <h3>Changing {@linkplain org.apache.lucene.search.similarities.DefaultSimilarity}</h3> |
| <p> |
| If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a |
| href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">Overriding Similarity</a>. |
| In summary, here are a few use cases: |
| <ol> |
| <li><p>The <code>SweetSpotSimilarity</code> in |
| <code>org.apache.lucene.misc</code> gives small |
| increases as the frequency increases a small amount |
| and then greater increases when you hit the "sweet spot", i.e. where |
| you think the frequency of terms is more significant.</p></li> |
| <li><p>Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a |
| matching term occurs. In these |
| cases people have overridden Similarity to return 1 from the tf() method.</p></li> |
| <li><p>Changing Length Normalization — By overriding |
| {@link org.apache.lucene.search.similarities.Similarity#computeNorm(FieldInvertState state)}, |
| it is possible to discount how the length of a field contributes |
| to a score. In {@link org.apache.lucene.search.similarities.DefaultSimilarity}, |
| lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be |
| 1 / (numTerms in field), all fields will be treated |
| <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</p></li> |
| </ol> |
| In general, Chris Hostetter sums it up best in saying (from <a |
| href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>): |
| <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just |
| that |
| it's "text" is a situation where it *might* make sense to to override your |
| Similarity method.</blockquote> |
| </p> |
| |
| </body> |
| </html> |