lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Confidence scores at search time
Date Fri, 06 Mar 2009 00:09:39 GMT

: That being said, I could see maybe determining a delta value such that if the
: distance between any two scores is more than the delta, you cut off the rest
: of the docs.  This takes into account the relative state of scores and is not
: some arbitrary value (although, the delta is, of course)

I read an interesting paper a while back that suggested a similar 
strategy for a related problem... 

...while the whole paper might be interesting to some, the relevant parts 
to this discussion are Section!2.1 and Table#1 .  the goal there is to 
identify which refrence set(s) are relevant to an input set -- they 
compute a similarty score for each set, sort them, and then compute the 
percentage difference for each successive pair.  they consider any set 
with a score above the average score for all sets *and* with a score 
percentage diff (relative the next highest scoring set) greater then some 
arbitrary delta to be a match.  (the theory being that an arbitrary 
percentage delta is better then an arbitrary score cutoff, and that you 
only want things scoring better then average, because as scores taper off 
on the lower end, they can taper off quickly and show very high percentage 

I have no idea how well this approach would work for general search (with 
a large set of documents and a large number of matches)

To keep in mind just how diverse the appraoches to this type of problem 
can be depending on the nitty gritty specifics of your use case, consider 
the "GuardianComponent" example from my BTB talk at apachecon last year 
(slides 32-25)...

...either of the approaches mention there tackle the "sacrifice recall to 
achieve greater precision" aspect of your problem in the specific domain 
of short documents where you want to eliminate matches that are 
significantly longer then the input (even if they score well using 
traditional tf/idf metrics)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message