lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Similarity.lengthNorm and positionIncrement=0
Date Mon, 13 Oct 2008 08:51:19 GMT

OK, this & Andrzej's logic makes sense -- let's add it as an option,  
but leave the default to the current approach of counting all tokens  
towards length norm.


Nadav Har'El wrote:

> On Sun, Oct 12, 2008, Michael McCandless wrote about "Re:  
> Similarity.lengthNorm and positionIncrement=0":
>> I agree we should make this possible.  A field should not be
>> "penalized" just because many of its terms had synonyms.
> I guess it won't do any harm to make this an option, but we need to  
> do some
> careful thinking before making this the default, or even encouraging  
> it.
> If we recall the rationale of length normalization, it is not to  
> "penalize"
> long documents, in the sense that users are less likely to want to  
> see long
> documents. Rather, the idea is that a long document contains more  
> words -
> more unique words and more repetitions of each word - so long  
> documents are
> more likely to match any query, and more likely to have higher  
> scores for
> each query. If you don't do length normalization, (almost) no matter  
> what
> search you preform, you'll get the longest documents back, rather  
> than the
> really best-matching documents. This is why length normalization is  
> necessary.
> Now, if we do synonym expension during indexing, the document *really*
> becomes longer - it now (possibly) contains more unique words and more
> repetitions thereof. So it actually makes sense, I think, to count  
> also
> these synonyms, and not try to avoid it.
> But you're right - if we're not talking about real synonyms, but  
> rather
> variants which will *never* be used in the same query (ASCII vs.  
> accented
> in your case), it does make sense not to count them twice, so it might
> indeed be useful to have this prosed behavior as an option.
> Anyway, this is just my opinion (not backed by any hard research or
> experimentation), so it might be wrong.
> -- 
> Nadav Har'El                        |      Monday, Oct 13 2008, 14  
> Tishri 5769
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                    |Windows-2000/Professional isn't.
>           |
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message