lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: product based term combination for BooleanQuery?
Date Tue, 03 Jul 2007 20:27:25 GMT
When you do an explain on these results, what are all the factors  
that contribute to the score?

Could you increase the coord() factor in a custom Similarity  
implementation, to give a bigger boost to documents that have more  
matching terms?  The point of coord is to give a little bump to those  
docs that have more terms from the query in a given document.  Sounds  
like you want a bigger bump once you have multiple query terms in a  
document.  Would this work for you?

Also, below...

On Jul 3, 2007, at 3:20 PM, Tim Sturge wrote:

> That's true, but it's not clear that I want phrase matches.  
> Consider for example:
>
> "Lucene Download" as a query. I want something that strongly  
> references "Lucene" (in the title) and strongly references  
> "Download" but "Download Lucene" or "Lucene Project Download" are  
> better than some page that happens to contain the exact phrase.

Not sure I follow you here.  By strongly references, do you mean  
there are multiple occurrences of Download?  Why would those  
alternatives be better than an exact phrase match?

>
> Other examples are "camera review" or "Gonzales scandal"; there's a  
> whole class of "subject <modifier>" queries that are not really  
> phrase based, and my corpus isn't large enough to necessarily  
> contain the phrase anyway.
>
> I agree that many two or three word queries are really best matched  
> by phrases, but not all. Is it common to use a phrase query with  
> high slop to overcome the unequal weighting problem?
>
> Also, my interface does support "\"John Bush\"" (ie the user can  
> quote the phrase if they like) and I would prefer not to infer  
> automatically that they meant to do so.
>
> Tim
>
> Jason Pump wrote:
>> You're not using any type of phrase search. Try ->
>>
>> ( (title:"John Bush"^4.0) OR (body:"John Bush") ) AND  
>> ( (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
>>
>> or maybe
>>
>> ( (title:"John Bush"~4^4.0) OR (body:"John Bush"~4) ) AND  
>> ( (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
>>
>>
>>
>> Tim Sturge wrote:
>>> I'm following myself up here to ask if anyone has experience or  
>>> code with a BooleanQuery that weights the terms it encounters on  
>>> a product basis rather than a sum basis.
>>>
>>> This would effectively compute the geometric mean of the term  
>>> score (rather than the arithmetic mean) and would give me more  
>>> "middle bias". It also has the great advantage that it  
>>> automatically implements AND (as something without the term has a  
>>> score of 0.0 which causes the query to go to 0.0 as well.)
>>>
>>> I'm curious though why this doesn't already exist. Is it a bad  
>>> idea in general (that I will discover once I implement it and  
>>> look at the results?) or does it make searching a lot slower?
>>>
>>> Thanks,
>>>
>>> Tim
>>>
>>> Tim Sturge wrote:
>>>> I have an index with two different sources of information, one  
>>>> small but of high quality (call it "title"), and one large, but  
>>>> of lower quality (call it "body").  I give boosts to certain  
>>>> documents related to their popularity (this is very similar to  
>>>> what one would do indexing the web).
>>>>
>>>> The problem I have is a query like "John Bush". I translate that  
>>>> into " (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush)  
>>>> ". But the results I get are:
>>>>
>>>> 1. George Bush
>>>> ...
>>>> 4. John Kerry
>>>> ...
>>>> 10. John Bush
>>>>
>>>> The reason is (looking at explain) that George Bush is scored:
>>>> 169 = sum(
>>>> 1 =  <match in body with tiny norm for "John">
>>>> )
>>>> 168 = sum(
>>>>     160 = <title match for "Bush">
>>>>     8 = <body match for "Bush">
>>>> )
>>>> )
>>>>
>>>> and John Kerry is similar but reversed. Poor old "John Bush"  
>>>> only scores:
>>>>
>>>> 72 = sum(
>>>>  40 = (<title match for "John">+<body match>)
>>>>  32 = (<title match for "Bush">+ <body match>)
>>>> )
>>>>
>>>> because his initial boost was only 1/4 of George's.
>>>>
>>>> The question I have is, how can tell the searcher to care about  
>>>> "balance"? I really want the score over 2 terms to be more like  
>>>> (sqrt(X)+sqrt(Y))^2 or maybe even exp(log(X)+log(Y))  rather  
>>>> than just X+Y. Is that supported in some obvious way, or is  
>>>> there some other way to phrase my query to say "I want both  
>>>> terms but they should both be important if possible?"
>>>>
>>>> Thanks,
>>>>
>>>> Tim
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message