lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject SweetSpotSimiliarity
Date Wed, 24 May 2006 04:55:19 GMT
[note: subject header changed from "Re: [jira] Updated: (LUCENE-577)  

Thought-provoking stuff, Hoss...

On May 23, 2006, at 5:55 PM, Hoss Man (JIRA) wrote:

> This is a new Similarity implimention for the contrib/ 
> miscellaneous/ package, it provides a Similiarty designed for  
> people who know the "sweetspot" of their data.  three major pieces  
> of functionality are included:
> 1) a lengthNorm which creates a "plateau" of values.

Presumably you had this in the can, and didn't just implement it  
today. :)  For those of you who didn't see this afternoon's thread  
"Per-Field Analyzer" on java-user, KinoSearch has used a plateau  
lengthNorm since version 0.06...

    1 / sqrt(max(100, numTerms))

... and it's been a mixed bag.

The suggestion came my way via Mark Bennett apparently from Doug  
originally, though I didn't see that thread.  Earlier discussion at (Link to  Mark's nifty  
graph is still up (linked from his email).

Making that algo the default achieved my goal: downgrade the type of  
"stub" documents Lucene tends to favor.  However, it also stopped  
excellent matches in fields which are supposed to be short -- like  
title -- from getting a good solid lift.

The only answer seems to be to apply different lengthNorm algos to  
different fields.

What uses have you found a plateau lengthNorm, Hoss?

> 2) a baseline tf that provides a fixed value for tf's up to a  
> minimum, at which point it becomes a sqrt curve (this is used by  
> the tf(int) function.
> 3) a hyperbolic tf function which is best explained by graphing the  
> equation.  this isn't used by default, but is available for  
> subclasses to call from their own tf functions.

... and when do you use these custom tf's?

I tried to graph the hyperbolic function (tip for OS X users: check  
out, in Utilities).  It looks like by default, everything  
cancels out it returns a constant 2.  But it's pretty complicated, so  
maybe I missed something.

My interest in this is being driven by a really savvy client with a  
formal mathematics background and a good feel for search engine  
design though no formal IR training.   Today, he wrote, "The title is  
not a discussion.  It's binary; this is being considered or it  
isn't.  The more words that are being considered, the less  
significant any one is, but you can't get more considered by being  
mentioned more than once in the title."

I think I would implement this by having tf always return 1 for the  
title field.

Thought: It would be really handy if we had a benchmarking test for  
IR precision.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message