lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clarity: Is there a Query boosting 50-50 over 1000-1 ?
Date Fri, 29 Aug 2008 14:40:21 GMT

On Aug 29, 2008, at 7:53 AM, S├ębastien Rainville wrote:

> I'm curious... what do you mean by "It's not perfect (there is no such
> thing) but it works pretty well in most cases, and works great if  
> you spend
> a little time figuring out the right length normalization  
> factors." ? Can
> you plz elaborate a little more on what are the length normalization  
> factors
> exactly and what makes them good or bad... it's a part of lucene  
> that is
> really confusing me as I'm still a newbie :P
>


If you're a newbie, its probably best not to go there just yet, but,  
since you asked...

Lucene and many search systems adjust scores based on how long  
documents are, the theory being that a shorter document w/ the  
relevant terms is more (as?) interesting than (as?) a longer document  
with the term repeated a ton of times.  It essentially acts as a  
counterweight to long documents with high term frequency values.  But,  
like pretty much everything in relevance tuning, as Erik says, "it  
depends".  It depends on things like your queries, your docs, etc.   
You (and by you, I mean your users) may actually prefer longer  
documents, or you may find that Lucene favors short documents too  
much.  Thus, one may want to override the lengthNorm() in the  
Similarity class.  The key, of course, is to tread into this after you  
have a working system and after you have established that you are,  
indeed, not happy with a _large_ number of results, at which point you  
need to do a methodical study of what the queries are, what the  
"right" results are, and then explore alternatives (even doing things  
like A/B testing), of which, length normalization modification may be  
one of them.

At a lower level, some people feel that a lengthNorm() of 1/ 
sqrt(numTerms) is not the right default, but I don't know that anyone  
has definitively said what a better default is.  It works pretty well  
for most people out of the box, which is why I made the comment about  
it probably not being best to go there just yet.  My gut says it is a  
value that Doug came up with way back when he was doing a lot of  
empirical testing and felt it was best and it really hasn't been  
modified since, but that is just a guess on my part, I haven't looked  
at the revision history of it.

You may find Doron's wiki entry informative: http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team

You also might find my talk at ApacheCon 07 helpful in general: http://people.apache.org/~gsingers/apachecon07/LucenePerformance.ppt

, starting at slide 23 or so where I talk about relevance.

Otherwise, dig into the archives at lucene.markmail.org and look up  
length normalization or relevance tuning, Similarity, etc.

HTH,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message