lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: IDF scoring issue
Date Wed, 17 Dec 2008 15:31:28 GMT

On Dec 17, 2008, at 9:26 AM, Rajiv2 wrote:

> Because, the search term is provided by a user, and that user would  
> explicity
> have to put quotes around "marietta ga" when I beleive the search  
> text as it
> is : fleming roofing inc., marietta ga  -- should score higher for  
> "marietta
> ga"

Just because the user doesn't do it, doesn't mean you can't.  Your  
stating that there is an implied ordering in their query, yet you  
don't want to take advantage of that. You can often achieve better  
results by generating phrase queries implicitly based on 2 or 3  
grams.  You might also even try generating the whole thing as a phrase  
query with a really large slop value (like 100 or more).  Thus,  
scoring will reward things when they are closer together, but you  
still get the flexibility of an AND-like query.  Downside is,  
possibly, a small performance hit, but you could test it first.  Or,  
you could add in the phrase query as an optional OR query to the  
original query, something like" fleming OR roofing OR marietta OR ga  
OR ("fleming roofing" OR "roofing marietta" OR "marietta ga".

You could also try using a more intelligent Query Parser that is tuned  
to your domain.  You could also try to factor in click-through stats  
into your results.  Probably not the answer you want to hear, but it  
is doable and useful.

Do you have any a priori knowledge about Marietta GA over Fleming, GA  
to begin with?  Have you done any broader scale relevance assessment?   
It is often the problem that "fixing" one query, results in breaking a  
whole bunch of others.  What I typically recommend is that you take  
the top 50 queries plus 10-30 random queries from your logs and do an  
assessment of the top 5/10 results for: relevant, somewhat relevant,  
not relevant and embarrassing.  The goal is to maximize relevant while  
minimizing embarrassing and not relevant.

Is this particular example an isolated case or do you feel this is  
systemic to your application?  I've said it before, but it bears  
repeating:  Just because someone typed search terms into your search  
box does not mean you have to actually do a search in order to present  
them results.  If you KNOW the Marietta result is a better result for  
this query, then make it the top result.  Solr has this feature via  
the "QueryElevationComponent" (horrible name, I know), but I call it  
Editorial Placement.  It's not that hard to implement.

Finally, I'd say I wouldn't split hairs over position too much, if the  
Marietta result is #2 and the Fleming result is #1.  Now, if you're  
telling me the Marietta result is something like #100 and Fleming is  
#1, that's a different story.  The fact is, b/c your user didn't put  
quotes, you don't actually know for a fact that the Fleming result is  
what they wanted (but I agree, it is highly likely).  The point is, I  
wouldn't quibble over anything that is in the top ten.  Lucene is  
doing what you told it to do, that is rank the results according to TF/ 
IDF, etc.  If you have other pertinent information about Marietta or  
the query then you should tell Lucene that via phrases, boosts or  
payloads or altering the Similarity.  But, like I said, be careful  
that you aren't breaking other queries.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message