{@link org.apache.lucene.search.similarities.DefaultSimilarity} is the original Lucene scoring function. It is based on a highly optimized Vector Space Model. For more information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}.
{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized implementation of the successful Okapi BM25 model.
{@link org.apache.lucene.search.similarities.SimilarityBase} provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on {@link org.apache.lucene.search.similarities.SimilarityBase}:
Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see a "fair" similarity).
To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.
To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely you'll want to simply subclass an existing method, be it {@link org.apache.lucene.search.similarities.DefaultSimilarity} or a descendant of {@link org.apache.lucene.search.similarities.SimilarityBase}), and then register the new class by calling {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)} before indexing and {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)} before searching.
The easiest way to quickly implement a new ranking method is to extend {@link org.apache.lucene.search.similarities.SimilarityBase}, which provides basic implementations for the low level . Subclasses are only required to implement the {@link org.apache.lucene.search.similarities.SimilarityBase#score(BasicStats, float, float)} and {@link org.apache.lucene.search.similarities.SimilarityBase#toString()} methods.
Another option is to extend one of the frameworks based on {@link org.apache.lucene.search.similarities.SimilarityBase}. These Similarities are implemented modularly, e.g. {@link org.apache.lucene.search.similarities.DFRSimilarity} delegates computation of the three parts of its formula to the classes {@link org.apache.lucene.search.similarities.BasicModel}, {@link org.apache.lucene.search.similarities.AfterEffect} and {@link org.apache.lucene.search.similarities.Normalization}. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.
If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:
The SweetSpotSimilarity
in
org.apache.lucene.misc
gives small
increases as the frequency increases a small amount
and then greater increases when you hit the "sweet spot", i.e. where
you think the frequency of terms is more significant.
Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization — By overriding {@link org.apache.lucene.search.similarities.Similarity#computeNorm(FieldInvertState state)}, it is possible to discount how the length of a field contributes to a score. In {@link org.apache.lucene.search.similarities.DefaultSimilarity}, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".
[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.