lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Kleven" <>
Subject Re: short documents = help me tweak Similarity??
Date Fri, 06 Apr 2007 04:13:30 GMT
Thank you kindly for the responses.

This was the solution that I dreamed up initially as well (overriding
lengthNorm) and making the returned values for small numTerms values (e.g. 3
and 4) more discrete.

So I did that in multiple ways, and I ran into a different problem.  If
lengthNorm returns something like this...
numTerms    return from lengthNorm()
1                  2.4
2                  2.0
3                  1.6
4                  1.2
5                  0.8
>5                1/sqrt(numTerms) wind up with a weird situation.  What happens is that for a document
like "Nissan Altima SE Sports Package Sunroof CD 2003" -- imagine someone
searching for that exact document with the exact search string (i.e, they
literally search "Nissan Altima SE Sports Package Sunroof CD 2003") then the
*wrong* hits bubble to the top, e.g., "Nissan Altima Sports Package" will be
the #1 hit even though there was an exact document matching every term.

So then i tried a bunch of different lengthNorm implementations (because
obviously the example lengthNorm above is drastic) ... some drastic (ala the
example above) and some less drastic but still hopefully allowing a big
enough lengthNorm value so that encode/decode would still see a difference
in search length.

And i've yet to be able to find a happy medium that works.  I know that i'm
doing something stupid, or not understanding the scoring algorithm enough (
i.e., there's another part of the similarity class that should take into
account when you have lots of search terms and corresponding matches) but I
just couldn't find a way to make it happen.

This is why i was wondering if could just overload encode/decode in both my
indexer and searcher, and thereby just avoid this whole thing entirely.  But
apparently that isn't a solution either?

Also, i don't understand why the encode/decode functions have a range of
7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts
set to 1.0) something between 1.0 and 0.  When would somebody have a monster
huge value like 7x10^9?  Even with a huge index time boost of 20.0 or
something, why would the encode/decode need a range as huge as the current

I know I am missing something.  Thanks so much for the reponses, I've been
playing with this for so long I knew it was time to post a message and get
to the bottom of this.  Thank u again -

On 4/5/07, Chris Hostetter <> wrote:
> : The problem comes when your float value is encoded into that 8 bit
> : field norm, the 3 length and 4 length both become the same 8 bit
> : value.  Call Similarity.encodeNorm on the values you calculate for the
> : different numbers of terms and make sure they return different byte
> : values.
> bingo.  You can't change encodeNorm and decodeNorm (because they are
> static -- they need to work at query time regardless of what Similarity
> was used at index time) but you can change the function of your length
> norm to make small deviations in short lengths significant enough to
> register even with the encoding.
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message