lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Hudson" <>
Subject Re: short documents = help me tweak Similarity??
Date Fri, 06 Apr 2007 05:04:56 GMT
 > Also, i don't understand why the encode/decode functions have a range of
> 7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts
> set to 1.0) something between 1.0 and 0.  When would somebody have a monster
> huge value like 7x10^9?  Even with a huge index time boost of 20.0 or
> something, why would the encode/decode need a range as huge as the current
> implementation?

I have often asked myself the same thing, I have just tried to avoid
depending on the field norms if possible.  For instance, if you have
your own array of how long each of your fields are you can just boost
the documents however you want in your HitCollector by looking up the
value in your array using the docId.  That is the approach we have
generally taken in our application.  You can get how many terms are in
each field by creating an array of length maxDoc and then iterating
over all of the TermPositions for that field and remembering the
maximum position that you saw for each document.  This array is also
useful for implementing exact phrase matching, so suppose someone
wants documents that match *exactly* "Nissan Altima", you would do a
phrase search for "Nissan Altima" and then just ignore all the results
that do not have exactly two terms in that field.  For example "Nissan
Altima Standard" would match that query but you would see in your
array that it has 3 terms, when you only care about results that have
2 terms.  But you have to implement your own HitCollector object and
use that instead of using the "Hits" interface.  To get an idea of how
to do that you can look at the HitCollector that the Hits object uses.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message