Woops. Got that backwards.. should read
> if (score[n] / score[n1]) < c / (boost_factor)
On Mon, May 25, 2009 at 4:10 PM, Babak Farhang <farhang@gmail.com> wrote:
> How about determining the cutoff by measuring the percentage
> difference between successive scores: if the score drops by a
> threshold amount then you've hit the cutoff. In the example you
> mention, you might want to try something like c/1000, where 1 < c < 25
> is a constant (experiment to find a sweet spot for c).
>
> I.e. something like
>
> if (score[n1] / score[n) < c / (boost_factor) ,
>
> then you've reached your cutoff at the n1th hit
> (where boost_factor=1000 in your example).
>
> One thing to check is that the scores are indeed sorted in descending
> order to begin with. For example, I don't think the hits in
> TopDocCollector and its brethren are strictly ordered this way (no?).
>
> Babak
>
> On Mon, May 18, 2009 at 6:52 AM, Joel Halbert <joel@su3analytics.com> wrote:
>> Hi,
>>
>> I'd like to apply a score filter. I realise that filtering by absolute
>> (i.e. anything less than x) scores is pretty meaningless.
>>
>> In my case I want to filter based on relative score  or on some
>> function of score which looks for clustering of documents around certain
>> score values.
>>
>> Context: I have set up field boosts such that a query hit on one indexed
>> field will, in theory, result in a score one or more order of magnitudes
>> greater than a hit on some other field. So if I have 2 fields A and B
>> and I'm really really interested in hits on A, and only interested in
>> hits on B if there were none on A, I boost A by 1000, relative to B.
>> The resultant score should reflect this.
>>
>> The ability to do this becomes important when we want to reorder the
>> search results around some other field (not score) and are not
>> interested in displaying the least relevant documents.
>>
>>
>> It is an easy thing to write a basic 'document collector/result filter'
>> that uses relative score information to filter out documents where any
>> score is less than some magnitude of the best score, but I'm sure this
>> could be more elegantly generalised into some mathematical
>> "relevance/significance" model/function which could determine some
>> optimal cutoff for documents based on the clustering of results around
>> scores.
>> e.g. if my top 5 documents are all between score 0.9 and 0.7 and the
>> remaining 10 are less than 0.01 then we could sensibly take the top 5
>> docs as most relevant.
>>
>> Has anyone experience of doing such a thing?
>>
>>
>> Regards,
>> Joel
>>
>>
>>
>> 
>>
>>
>

