lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: product based term combination for BooleanQuery?
Date Tue, 03 Jul 2007 17:56:45 GMT
Try out: http://issues.apache.org/jira/browse/LUCENE-850

If this is useful to you, be sure to add a comment to the issue.

-Mike

On 3-Jul-07, at 10:51 AM, Tim Sturge wrote:

> I'm following myself up here to ask if anyone has experience or  
> code with a BooleanQuery that weights the terms it encounters on a  
> product basis rather than a sum basis.
>
> This would effectively compute the geometric mean of the term score  
> (rather than the arithmetic mean) and would give me more "middle  
> bias". It also has the great advantage that it automatically  
> implements AND (as something without the term has a score of 0.0  
> which causes the query to go to 0.0 as well.)
>
> I'm curious though why this doesn't already exist. Is it a bad idea  
> in general (that I will discover once I implement it and look at  
> the results?) or does it make searching a lot slower?
>
> Thanks,
>
> Tim
>
> Tim Sturge wrote:
>> I have an index with two different sources of information, one  
>> small but of high quality (call it "title"), and one large, but of  
>> lower quality (call it "body").  I give boosts to certain  
>> documents related to their popularity (this is very similar to  
>> what one would do indexing the web).
>>
>> The problem I have is a query like "John Bush". I translate that  
>> into " (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush)  
>> ". But the results I get are:
>>
>> 1. George Bush
>> ...
>> 4. John Kerry
>> ...
>> 10. John Bush
>>
>> The reason is (looking at explain) that George Bush is scored:
>> 169 = sum(
>> 1 =  <match in body with tiny norm for "John">
>> )
>> 168 = sum(
>>     160 = <title match for "Bush">
>>     8 = <body match for "Bush">
>> )
>> )
>>
>> and John Kerry is similar but reversed. Poor old "John Bush" only  
>> scores:
>>
>> 72 = sum(
>>  40 = (<title match for "John">+<body match>)
>>  32 = (<title match for "Bush">+ <body match>)
>> )
>>
>> because his initial boost was only 1/4 of George's.
>>
>> The question I have is, how can tell the searcher to care about  
>> "balance"? I really want the score over 2 terms to be more like  
>> (sqrt(X)+sqrt(Y))^2 or maybe even exp(log(X)+log(Y))  rather than  
>> just X+Y. Is that supported in some obvious way, or is there some  
>> other way to phrase my query to say "I want both terms but they  
>> should both be important if possible?"
>>
>> Thanks,
>>
>> Tim
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message