lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: SweetSpotSimiliarity
Date Wed, 24 May 2006 17:22:06 GMT

On May 23, 2006, at 11:38 PM, Chris Hostetter wrote:

> it has the nice property of giving small increases as the frequency
> increases a small amount, then increasing faster once you reach the  
> point
> where you think small increases are significant, and then grows slower
> again once you are above the point where you think more occurances are
> acctually significant.


"Normalization" and "norms" are funny words to use in this context,  
since you're aggressively manipulating a score multiplier rather than  
normalizing in the usual sense.

> : I tried to graph the hyperbolic function (tip for OS X users: check
> : out, in Utilities).  It looks like by default,  
> everything
> : cancels out it returns a constant 2.  But it's pretty  
> complicated, so
> : maybe I missed something.
> Hmm... maybe i screwed up the defaults at some point ...

Nah, I found my error -- just a typo that happened somewhere while I  
was swapping in the default values.  I now see something similar to  
what you describe, though the plateaus above and below the transition  
look completely flat.

> Alas ... tf() doesn't take in a field name, to do this, you'd have to
> override the Similarity each time your construct a query object,
> something like this i believe...
>    Query q = new TermQuery(t) {
>       public Similarity getSimilarity(Searcher s) {
>          return new SimilarityDelegator
>             (TermQuery.this.super.getSimilarity(s)) {
>                public float tf(freq) {
>                  ...
>                }
>             }
>          }
>       }
>    }
> ...but good lord if that isn't a pain.

Well, let's toss aside backwards-compatibility concerns for the  
purposes of discussion, and see what it would take to make tf()  
change per-Field.

Adding a fieldName argument to would add  
significant overhead, since it gets called a *lot*.

To avoid that, my first thought is that you'd need to supply a  
different Similarity object for each field, by adding a fieldName  
argument to searcher.getSimilarity().  I doubt this would work,  
because Lucene's freq/prox files -- unlike it's norms -- are  
consolidated, with terms from multiple fields in one file.  It would  
be hard for the Scorer to know what field it was operating on.

My gut is telling me that this is another reason to consolidate freq,  
prox, and norm/boost into a single stream.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message