lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Halbert <>
Subject Re: relevance function for scores
Date Mon, 18 May 2009 13:25:42 GMT
Hi Erick,

Thanks for the pointer. Sorry if the question was a bit unclear but
basically I'm looking to see if anyone has any pointers on the actual
mathematical functions or models to use (rather than the
implementation). I'd be really interested to hear what others have used
to solve this - since ideally I'd like a cutoff point optimised to the
resultant score values.


-----Original Message-----
From: Erick Erickson <>
Subject: Re: relevance function for scores
Date: Mon, 18 May 2009 09:13:27 -0400

Have you looked at TopDocCollector? Basically, you can tell itto only return
you the top N docs by score (N is arbitrary).
What you then have is an array of raw score and doc ID pairs
AND a max score.

NOTE: "raw score" is not normalized, i.e. is not guaranteed to be
between 0 and 1.

So now you can examine the scores and put them in buckets any
way you want, all you're doing is spinning through a small data
structure performing some calculations.....


On Mon, May 18, 2009 at 8:52 AM, Joel Halbert <> wrote:

> Hi,
> I'd like to apply a score filter. I realise that filtering by absolute
> (i.e. anything less than x) scores is pretty meaningless.
> In my case I want to filter based on relative score - or on some
> function of score which looks for clustering of documents around certain
> score values.
> Context: I have set up field boosts such that a query hit on one indexed
> field will, in theory, result in a score one or more order of magnitudes
> greater than a hit on some other field. So if I have 2 fields A and B
> and I'm really really interested in hits on A, and only interested in
> hits on B if there were none on A,  I boost A by 1000, relative to B.
> The resultant score should reflect this.
> The ability to do this becomes important when we want to re-order the
> search results around some other field (not score) and are not
> interested in displaying the least relevant documents.
> It is an easy thing to write a basic 'document collector/result filter'
> that uses relative score information to filter out documents where any
> score is less than some magnitude of the best score, but I'm sure this
> could be more elegantly generalised into some mathematical
> "relevance/significance" model/function  which could determine some
> optimal cutoff for documents based on the clustering of results around
> scores.
> e.g. if my top 5 documents are all between score 0.9 and 0.7 and the
> remaining 10 are less than 0.01 then we could sensibly take the top 5
> docs as most relevant.
> Has anyone experience of doing such a thing?
> Regards,
> Joel
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message