lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Contribution: better multi-field searching
Date Wed, 13 Oct 2004 19:56:27 GMT
Chuck Williams wrote:
> That approach does not work.  I could not find an approach that would
> work with the built-in classes, although of course there might be one.
> The problem has two components:  coord and the fact that BooleanQuery's
> sum their clause scores to compute the final score.  The latter is not
> easily overridden.  Specifically,
> 
>   title:(albino elephant)^4 description:(albino elephant)
> 
> still has the problem that a result with albino in the title and albino
> in the description gets the same score as a result with albino in the
> title and elephant in the description 

Perhaps I misunderstood what you desire.  You want a reward for albino 
and elephant both occurring in the document, regardless of field, if so, 
then what you'd want is:

(title:albino description:albino) (title:elephant description:elephant)

with coord disabled on the *inner* queries, no?  This way coord would 
explicitly boost documents which matched on both terms.

> FYI, MaxDisjunctionQuery has made an enormous improvement in the quality
> of my query results, and I have strong reason to believe the same would
> be true in most other domains (more on that coming in the idf^2
> discussion).  In terms of the albino elephant example, the query above
> was putting all the albino animals except elephants above the albino
> elephants, while the query with an outer BooleanQuery and inner
> MaxDisjunctionQuery's
> 
>     ( (title:albino^4 | description:albino)~0.1
>       (title:elephant^4 | description:elephant)~0.1
>     )
> 
> properly puts the albino elephants on top.

If "albino" is outscoring "elephant" then you could either reduce the 
impact of idf or increase the impact of coordination.  Did you try, 
e.g., defining coord as (overlap/max)^2 or somesuch?

Or, perhaps take proximity into account, with "albino elephant"~10?  Or 
simply using AND instead of OR?  These days most web search engines use 
AND as the default operator and reward for proximity.  Is that wrong for 
your application?  AND is effectively a coord of (overlap/max)^infinity.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message