lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3174) Similarity.Stats class for term & collection statistics
Date Mon, 13 Jun 2011 14:23:51 GMT


Robert Muir commented on LUCENE-3174:

Almost completely removed idf from the Weights – it still lingers in explain(). 

Right, explain() is a big TODO of a refactoring job, you did the right thing, its not easily
solved until we refactor it big-time so that any arbitrary Similarity can explain its own
scoring. Not to make any promises, but I think by doing such a thing (letting a Similarity
control how the explaining works), we will make progress towards LUCENE-3118 too: if you customize
the scoring system for your app, you should be able to explain the scores in a way that make
sense to your app too.

The DocScorer factory methods now need both the Weight and the Stats; that's the best I could
do for now.

This sounds like a good step to me! We want to just pass only the Stats to the DocScorer factory
methods, but we have some more work to do before that... such as better handling of the whole
boosting situation and pushing all responsibility for query normalization into stats.

once we have done this, i think Weight/Stats will make sense (except for naming) as it will
be be the parallel of Scorer/DocScorer, full responsibility for scoring is in the Similarity
and Weight/Scorer only handle things like seeking to terms, creating docsenums, iterating
postings lists, etc :)

> Similarity.Stats class for term & collection statistics
> -------------------------------------------------------
>                 Key: LUCENE-3174
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/search
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>             Fix For: flexscoring branch
>         Attachments: LUCENE-3174.patch, LUCENE-3174.patch, LUCENE-3174.patch, LUCENE-3174_normalize_boost.patch
> In order to support ranking methods besides TF-IDF, we need to make the statistics they
need available. These statistics could be computed in computeWeight (soon to become computeStats)
and stored in a separate object for easy access. Since this object will be used solely by
subclasses of Similarity, it should be implented as a static inner class, i.e. Similarity.Stats.
> There are two ways this could be implemented:
> - as a single Similarity.Stats class, reused by all ranking algorithms. In this case,
this class would have a member field for all statistics;
> - as a hierarchy of Stats classes, one for each ranking algorithm. Each subclass would
define only the statistics needed for the ranking algorithm.
> In the second case, the Stats class in DefaultSimilarity would have a single field, idf,
while the one in e.g. BM25Similarity would have idf and average field/document length.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message