lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: SweetSpotSimilarity
Date Mon, 05 Mar 2012 23:24:09 GMT
On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill <> wrote:
>> I would definitely not suggest using SSS for fields like legal brief text or emails
where there is huge
>> variability in the length of the content -- i can't think of any context where a
"short" email is
>> definitively better/worse then a "long" email.  more traditional TF/IDF seems like
it would make more
>> sense there.
> I was coming to a similar conclusion.
>> well ... hopefully the Similarity docs and the the docs on Lucene scoring have filled
in most of those
>> blanks before you drill down into the specifics of how SSS work.  if not, then any
>> improvements you can suggest would certainly be apprecaited...
> Thanks for the links.
> The first thing I notice is that what is listed at the top of Similarity is totally changed.
 Great stuff about the object interaction. For example, I didn't understand how Weight object
fit in until reading that.
> But I see I got what I asked for.  Someone thought describing the object interaction
was more important than the scoring formula itself.  I chew on it (but I'm currently using
the 3.4 code).
> My only thought is that the new stuff seems to be at the expense of the formulas listed
in the old class overview for Similarity.


what is previously Similarity in older releases is moved to
TFIDFSimilarity: it extends Similarity and exposes a vector-space API,
with its same formulas in the javadocs:

The difference is that in 4.0, the idea is to support other scoring
models beyond the vector space model:  thats why if you start looking
at other subclasses of Similarity you will find more options (e.g.
probabilistic models).

This change is described in CHANGES.txt (below). I hope its not
confusing: if you have ideas to improve the javadocs and present this
stuff better for migrating users, it would be very helpful.

* LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from
  Query/Weight/Scorer. If you extended Similarity directly before, you should
  extend TFIDFSimilarity instead.  Similarity is now a lower-level API to
  implement other scoring algorithms.  See MIGRATE.txt for more details.

* LUCENE-2959: Added a variety of different relevance ranking systems to Lucene.

  - Added Okapi BM25, Language Models, Divergence from Randomness, and
    Information-Based Models. The models are pluggable, support all of lucene's
    features (boosts, slops, explanations, etc) and queries (spans, etc).

  - All models default to the same index-time norm encoding as
    DefaultSimilarity, so you can easily try these out/switch back and
    forth/run experiments and comparisons without reindexing. Note: most of
    the models do rely upon index statistics that are new in Lucene 4.0, so
    for existing 3.x indexes its a good idea to upgrade your index to the
    new format with IndexUpgrader first.

  - Added a new subclass SimilarityBase which provides a simplified API
    for plugging in new ranking algorithms without dealing with all of the
    nuances and implementation details of Lucene.

  - For example, to use BM25 for all fields:
     searcher.setSimilarity(new BM25Similarity());

    If you instead want to apply different similarities (e.g. ones with
    different parameter values or different algorithms entirely) to different
    fields, implement PerFieldSimilarityWrapper with your per-field logic.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message