lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Pump <ja...@healthline.com>
Subject Re: product based term combination for BooleanQuery?
Date Tue, 03 Jul 2007 18:57:07 GMT
You're not using any type of phrase search. Try ->

( (title:"John Bush"^4.0) OR (body:"John Bush") ) AND ( (title:John^4.0 
body:John) AND (title:Bush^4.0 body:Bush) )

or maybe

( (title:"John Bush"~4^4.0) OR (body:"John Bush"~4) ) AND ( 
(title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )



Tim Sturge wrote:
> I'm following myself up here to ask if anyone has experience or code 
> with a BooleanQuery that weights the terms it encounters on a product 
> basis rather than a sum basis.
>
> This would effectively compute the geometric mean of the term score 
> (rather than the arithmetic mean) and would give me more "middle 
> bias". It also has the great advantage that it automatically 
> implements AND (as something without the term has a score of 0.0 which 
> causes the query to go to 0.0 as well.)
>
> I'm curious though why this doesn't already exist. Is it a bad idea in 
> general (that I will discover once I implement it and look at the 
> results?) or does it make searching a lot slower?
>
> Thanks,
>
> Tim
>
> Tim Sturge wrote:
>> I have an index with two different sources of information, one small 
>> but of high quality (call it "title"), and one large, but of lower 
>> quality (call it "body").  I give boosts to certain documents related 
>> to their popularity (this is very similar to what one would do 
>> indexing the web).
>>
>> The problem I have is a query like "John Bush". I translate that into 
>> " (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) ". But 
>> the results I get are:
>>
>> 1. George Bush
>> ...
>> 4. John Kerry
>> ...
>> 10. John Bush
>>
>> The reason is (looking at explain) that George Bush is scored:
>> 169 = sum(
>> 1 =  <match in body with tiny norm for "John">
>> )
>> 168 = sum(
>>     160 = <title match for "Bush">
>>     8 = <body match for "Bush">
>> )
>> )
>>
>> and John Kerry is similar but reversed. Poor old "John Bush" only 
>> scores:
>>
>> 72 = sum(
>>  40 = (<title match for "John">+<body match>)
>>  32 = (<title match for "Bush">+ <body match>)
>> )
>>
>> because his initial boost was only 1/4 of George's.
>>
>> The question I have is, how can tell the searcher to care about 
>> "balance"? I really want the score over 2 terms to be more like 
>> (sqrt(X)+sqrt(Y))^2 or maybe even exp(log(X)+log(Y))  rather than 
>> just X+Y. Is that supported in some obvious way, or is there some 
>> other way to phrase my query to say "I want both terms but they 
>> should both be important if possible?"
>>
>> Thanks,
>>
>> Tim
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message