lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Confidence scores at search time
Date Mon, 02 Mar 2009 21:22:24 GMT

On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:

> Hi Grant,
> It's true, I may have an X-Y problem here. =)
> My basic need is to sacrifice recall to achieve greater precision.   
> Rather
> than always presenting the user with the top N documents, I need to  
> return
> *only* the documents that seem relevant.  For some searches this may  
> be 3
> documents, for some it may be none.

Therein lies the rub.  How are you determining what is relevant?  In  
some sense, you are asking Lucene to determine what is relevant and  
then turning around and telling it you are not happy with it doing  
what you told it to do (I'm exaggerating a bit, I know), namely tell  
you what the relevant documents are for a given query and a set of  
documents based on it's scoring model.  As an alternate tack, I  
usually look at this type of thing and try to figure out a way to make  
my queries more precise (e.g. replace OR with AND, introduce phrase  
queries, filter or add NOT clauses or some other qualifiers) or some  
other relevance tricks [1], [2].

That being said, I could see maybe determining a delta value such that  
if the distance between any two scores is more than the delta, you cut  
off the rest of the docs.  This takes into account the relative state  
of scores and is not some arbitrary value (although, the delta is, of  

Since you are allowing the user to "explore", it may be more  
reasonable to cutoff at some point, too, but I still don't know of a  
good way to determine what that point is in a generic way.  Maybe with  
some specific knowledge about how you are creating your queries and  
what query terms matched you could come up with something, but still,  
I am uncertain.

The other thing that strikes me is that you add in some type of  
learning/memory component that tracks your click-through information  
and gives feedback into the system about relevance.

> My user interface in this case isn't the standard "type words in a  
> box and
> we'll show you the best docs" - I'm using Lucene as a tool in the  
> background
> to do some exploration about how I could augment a set of traditional
> results with a few alternative results gleaned from a different path.
> Not sure if this helps with the X-Y problem, but that's my task at  
> hand.


Also, keep in mind there are other techniques for encouraging  
exploration: clustering, faceting, info extraction (identifying named  
entities, etc. and presenting them)

Just throwing out some food for thought.

> Also, while perusing the threads you refer to below, I saw a  
> reference to
> the following link, which seems to have gone dead:

Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is  
anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message