lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadav Har'El" <>
Subject Re: Similarity.lengthNorm and positionIncrement=0
Date Mon, 13 Oct 2008 07:58:52 GMT
On Sun, Oct 12, 2008, Michael McCandless wrote about "Re: Similarity.lengthNorm and positionIncrement=0":
> I agree we should make this possible.  A field should not be  
> "penalized" just because many of its terms had synonyms.

I guess it won't do any harm to make this an option, but we need to do some
careful thinking before making this the default, or even encouraging it.

If we recall the rationale of length normalization, it is not to "penalize"
long documents, in the sense that users are less likely to want to see long
documents. Rather, the idea is that a long document contains more words -
more unique words and more repetitions of each word - so long documents are
more likely to match any query, and more likely to have higher scores for
each query. If you don't do length normalization, (almost) no matter what
search you preform, you'll get the longest documents back, rather than the
really best-matching documents. This is why length normalization is necessary.

Now, if we do synonym expension during indexing, the document *really*
becomes longer - it now (possibly) contains more unique words and more
repetitions thereof. So it actually makes sense, I think, to count also
these synonyms, and not try to avoid it.

But you're right - if we're not talking about real synonyms, but rather
variants which will *never* be used in the same query (ASCII vs. accented
in your case), it does make sense not to count them twice, so it might
indeed be useful to have this prosed behavior as an option.

Anyway, this is just my opinion (not backed by any hard research or
experimentation), so it might be wrong.

Nadav Har'El                        |      Monday, Oct 13 2008, 14 Tishri 5769
IBM Haifa Research Lab              |-----------------------------------------
                                    |Windows-2000/Professional isn't.           |

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message