lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject RE: cvs commit: jakarta-lucene TODO.txt
Date Tue, 28 May 2002 21:39:37 GMT
> From: Alex Murzaku []
> Second, never heard from Doug whether it is possible
> in theory to implement some other similarity/distance
> function and to plug these instead of the standard
> enhanced tf*idf. I think there was at list one other
> Lucene user interested in this (especially in the case
> of short fields like addresses, single sentences,
> etc.)

Some scoring changes are hard, some are easy.

Relatively easy things:
 - changing per-term factor in score -- currently idf, i.e.,
 - changing factor based on term's freq within document -- currently
 - changing the coordination factor, the boost a hit gets for containing a
large percentage of the query terms.

These all correspond to methods in  These could be made
into a TermWeight interface, with a default implementation, and a way to
specify an alternate implementation when building a searcher.

Somewhat harder:
 - changing per-document factor in score -- currently sqrt(docLength).  This
is also a Similarity method, but it is called when the index is created, so
its implementation cannot be changed at search time.

The scoring formula sums products of all these factors.

Harder-yet things:
 - change the form of the scoring formula itself: a fair amount of code
assumes that scores are a sums of products of the above factors.  It would
be challenging to design things both so that the formula can be easily
altered and so that things are efficient.  I think if folks really want to
change the formula fundamentally, they're best off using IndexReader
directly and writing a search algorithm from scratch.

So what in particular that you're interested in altering?  Would you be
satisfied with the addition of a TermWeight interface?


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message