lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <tstu...@metaweb.com>
Subject Re: product based term combination for BooleanQuery?
Date Tue, 03 Jul 2007 23:43:46 GMT
Here's the explain output I currently get for "George Bush" "George W 
Bush", "John Kerry" "John Denver" and "John Bush". (there are others in 
between, but they follow very much the same pattern; an enormous score 
for one of "John" or "Bush" and a very small score for the other being 
better than an average score for both.

As you can see I have a lot of fields, some very important (name, alias, 
title, anchor) and others much less important (text, surround, content, 
body).

I will experiment with DisjunctionMaxQuery, but it honestly seems like 
ProductQuery is what I want at the outer layer with BooleanQuery inside.

Tim

Grant Ingersoll wrote:
> When you do an explain on these results, what are all the factors that 
> contribute to the score?
>
> Could you increase the coord() factor in a custom Similarity 
> implementation, to give a bigger boost to documents that have more 
> matching terms? The point of coord is to give a little bump to those 
> docs that have more terms from the query in a given document. Sounds 
> like you want a bigger bump once you have multiple query terms in a 
> document. Would this work for you?
>
> Also, below...
>
> On Jul 3, 2007, at 3:20 PM, Tim Sturge wrote:
>
>> That's true, but it's not clear that I want phrase matches. Consider 
>> for example:
>>
>> "Lucene Download" as a query. I want something that strongly 
>> references "Lucene" (in the title) and strongly references "Download" 
>> but "Download Lucene" or "Lucene Project Download" are better than 
>> some page that happens to contain the exact phrase.
>
> Not sure I follow you here. By strongly references, do you mean there 
> are multiple occurrences of Download? Why would those alternatives be 
> better than an exact phrase match?
>
>>
>> Other examples are "camera review" or "Gonzales scandal"; there's a 
>> whole class of "subject <modifier>" queries that are not really 
>> phrase based, and my corpus isn't large enough to necessarily contain 
>> the phrase anyway.
>>
>> I agree that many two or three word queries are really best matched 
>> by phrases, but not all. Is it common to use a phrase query with high 
>> slop to overcome the unequal weighting problem?
>>
>> Also, my interface does support "\"John Bush\"" (ie the user can 
>> quote the phrase if they like) and I would prefer not to infer 
>> automatically that they meant to do so.
>>
>> Tim
>>
>> Jason Pump wrote:
>>> You're not using any type of phrase search. Try ->
>>>
>>> ( (title:"John Bush"^4.0) OR (body:"John Bush") ) AND ( 
>>> (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
>>>
>>> or maybe
>>>
>>> ( (title:"John Bush"~4^4.0) OR (body:"John Bush"~4) ) AND ( 
>>> (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
>>>
>>>
>>>
>>> Tim Sturge wrote:
>>>> I'm following myself up here to ask if anyone has experience or 
>>>> code with a BooleanQuery that weights the terms it encounters on a 
>>>> product basis rather than a sum basis.
>>>>
>>>> This would effectively compute the geometric mean of the term score 
>>>> (rather than the arithmetic mean) and would give me more "middle 
>>>> bias". It also has the great advantage that it automatically 
>>>> implements AND (as something without the term has a score of 0.0 
>>>> which causes the query to go to 0.0 as well.)
>>>>
>>>> I'm curious though why this doesn't already exist. Is it a bad idea 
>>>> in general (that I will discover once I implement it and look at 
>>>> the results?) or does it make searching a lot slower?
>>>>
>>>> Thanks,
>>>>
>>>> Tim
>>>>
>>>> Tim Sturge wrote:
>>>>> I have an index with two different sources of information, one 
>>>>> small but of high quality (call it "title"), and one large, but of 
>>>>> lower quality (call it "body"). I give boosts to certain documents 
>>>>> related to their popularity (this is very similar to what one 
>>>>> would do indexing the web).
>>>>>
>>>>> The problem I have is a query like "John Bush". I translate that 
>>>>> into " (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) 
>>>>> ". But the results I get are:
>>>>>
>>>>> 1. George Bush
>>>>> ...
>>>>> 4. John Kerry
>>>>> ...
>>>>> 10. John Bush
>>>>>
>>>>> The reason is (looking at explain) that George Bush is scored:
>>>>> 169 = sum(
>>>>> 1 = <match in body with tiny norm for "John">
>>>>> )
>>>>> 168 = sum(
>>>>> 160 = <title match for "Bush">
>>>>> 8 = <body match for "Bush">
>>>>> )
>>>>> )
>>>>>
>>>>> and John Kerry is similar but reversed. Poor old "John Bush" only 
>>>>> scores:
>>>>>
>>>>> 72 = sum(
>>>>> 40 = (<title match for "John">+<body match>)
>>>>> 32 = (<title match for "Bush">+ <body match>)
>>>>> )
>>>>>
>>>>> because his initial boost was only 1/4 of George's.
>>>>>
>>>>> The question I have is, how can tell the searcher to care about 
>>>>> "balance"? I really want the score over 2 terms to be more like 
>>>>> (sqrt(X)+sqrt(Y))^2 or maybe even exp(log(X)+log(Y)) rather than 
>>>>> just X+Y. Is that supported in some obvious way, or is there some 
>>>>> other way to phrase my query to say "I want both terms but they 
>>>>> should both be important if possible?"
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------- 
>>>>> -- 
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


Mime
View raw message