lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Halbert <>
Subject Re: relevance function for scores
Date Tue, 26 May 2009 11:15:30 GMT
Yes, something like this might work, although rather than having a
cutoff determined by the difference between two successive document
scores (Doc(n) and Doc(n-1)) I was thinking of using a function which
looked at the distribution of the scores of all matching documents.
Since I just want to exclude outliers it might be a simple case of
dropping those which have a score of less than -x standard deviations or
more. The more the graph was positively skewed the less confidence we
would have in those documents in the tail.

Since such a function would be a function of all documents the ordering of docs in a collector
would not be relevant. 
The main purpose for applying such a filter was the need to allow users to pivot the search
results by some field other than the natural ordering by score.  
In this case we only want to show the most relevant results.

-----Original Message-----
From: Babak Farhang <>
Subject: Re: relevance function for scores
Date: Mon, 25 May 2009 16:11:32 -0600

Woops. Got that backwards.. should read

> if (score[n]  / score[n-1])  < c / (boost_factor)

On Mon, May 25, 2009 at 4:10 PM, Babak Farhang <> wrote:
> How about determining the cutoff by measuring the percentage
> difference between successive scores: if the score drops by a
> threshold amount then you've hit the cutoff.  In the example you
> mention, you might want to try something like c/1000, where 1 < c < 25
> is a constant (experiment to find a sweet spot for c).
> I.e. something like
> if (score[n-1]  / score[n)  < c / (boost_factor) ,
> then you've reached your cutoff at the n-1th hit
> (where boost_factor=1000 in your example).
> One thing to check is that the scores are indeed sorted in descending
> order to begin with.  For example, I don't think the hits in
> TopDocCollector and its brethren are strictly ordered this way (no?).
> -Babak
> On Mon, May 18, 2009 at 6:52 AM, Joel Halbert <> wrote:
>> Hi,
>> I'd like to apply a score filter. I realise that filtering by absolute
>> (i.e. anything less than x) scores is pretty meaningless.
>> In my case I want to filter based on relative score - or on some
>> function of score which looks for clustering of documents around certain
>> score values.
>> Context: I have set up field boosts such that a query hit on one indexed
>> field will, in theory, result in a score one or more order of magnitudes
>> greater than a hit on some other field. So if I have 2 fields A and B
>> and I'm really really interested in hits on A, and only interested in
>> hits on B if there were none on A,  I boost A by 1000, relative to B.
>> The resultant score should reflect this.
>> The ability to do this becomes important when we want to re-order the
>> search results around some other field (not score) and are not
>> interested in displaying the least relevant documents.
>> It is an easy thing to write a basic 'document collector/result filter'
>> that uses relative score information to filter out documents where any
>> score is less than some magnitude of the best score, but I'm sure this
>> could be more elegantly generalised into some mathematical
>> "relevance/significance" model/function  which could determine some
>> optimal cutoff for documents based on the clustering of results around
>> scores.
>> e.g. if my top 5 documents are all between score 0.9 and 0.7 and the
>> remaining 10 are less than 0.01 then we could sensibly take the top 5
>> docs as most relevant.
>> Has anyone experience of doing such a thing?
>> Regards,
>> Joel
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message