lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: relevance function for scores
Date Mon, 18 May 2009 13:13:27 GMT
Have you looked at TopDocCollector? Basically, you can tell itto only return
you the top N docs by score (N is arbitrary).
What you then have is an array of raw score and doc ID pairs
AND a max score.

NOTE: "raw score" is not normalized, i.e. is not guaranteed to be
between 0 and 1.

So now you can examine the scores and put them in buckets any
way you want, all you're doing is spinning through a small data
structure performing some calculations.....


On Mon, May 18, 2009 at 8:52 AM, Joel Halbert <> wrote:

> Hi,
> I'd like to apply a score filter. I realise that filtering by absolute
> (i.e. anything less than x) scores is pretty meaningless.
> In my case I want to filter based on relative score - or on some
> function of score which looks for clustering of documents around certain
> score values.
> Context: I have set up field boosts such that a query hit on one indexed
> field will, in theory, result in a score one or more order of magnitudes
> greater than a hit on some other field. So if I have 2 fields A and B
> and I'm really really interested in hits on A, and only interested in
> hits on B if there were none on A,  I boost A by 1000, relative to B.
> The resultant score should reflect this.
> The ability to do this becomes important when we want to re-order the
> search results around some other field (not score) and are not
> interested in displaying the least relevant documents.
> It is an easy thing to write a basic 'document collector/result filter'
> that uses relative score information to filter out documents where any
> score is less than some magnitude of the best score, but I'm sure this
> could be more elegantly generalised into some mathematical
> "relevance/significance" model/function  which could determine some
> optimal cutoff for documents based on the clustering of results around
> scores.
> e.g. if my top 5 documents are all between score 0.9 and 0.7 and the
> remaining 10 are less than 0.01 then we could sensibly take the top 5
> docs as most relevant.
> Has anyone experience of doing such a thing?
> Regards,
> Joel
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message