lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Baby steps towards making Lucene's scoring more flexible...
Date Fri, 26 Feb 2010 17:50:44 GMT
In thinking about & discussing with Robert how to allow Lucene to
support other scoring models, eg lnu.ltc, BM25, etc.... I think a
relatively contained set of changes can give us a solid step forward.
Something like this:

  * Store additional per-doc stats in the index, eg in a custom
    posting list, including length in tokens of the field, avg tf, and
    boost (boost can be efficiently stored so only if it differs from
    default is it stored).  Do not compute nor store norms in the
    index.  Merging would just concatenate these values (removing
    deleted docs).

  * Change IR so on open it generates norms dynamically, ie by walking
    the stats, computing avgs (eg avg field length in tokens), and
    computing the final per-field boost, casting to a 1-byte quantized
    float.  We may want to store aggregates in eg SegmentInfo to save
    the extra pass on IR open...

  * Change Similarity, to allow field-specific Similarity (I think we
    have issue open for this already).  I think, also, lengthNorm
    (which is no longer invoked during indexing) would no longer be
    used.

I think we'd make the class that computes norms from these per-doc
stats on IR open pluggable.  And, someday we could make what stats are
gathered/stored during indexing pluggable but for starters I think we
should simply support the field length in tokens and avg tf per field.

Thoughts?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message