lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Reducing number of poor results from large BooleanQueries
Date Fri, 09 Sep 2005 22:51:17 GMT

: Here is an approach which works based on the quantity
: of matching terms in an adapted BooleanQuery:
:
: http://issues.apache.org/bugzilla/show_bug.cgi?id=35284

Doh! ... I should really start paying attention to the stuff in SVN, I
didn't even know there was a DisjunctionSumScorer -- this is exactly what
i was had in mind when i first started thinking about "Alternative #2",

But...

: This approach of course is based purely on the
: quantity of matching terms, not the quality-based

...this is what I'm worried about.

: measures in your example. As you suggest, quality is a
: combination of user-derived measures (boosts) and
: data-derived measures (tf,idf, docBoost). It sounds
: like a more informed  approach in principle but I'm
: not currently sure how it would be implemented
: efficiently in practice. Here's one possible approach

that's the thing, i'm thinking that if there was a subclass of
DisjunctionSumScorer (say "DisjunctionBoostSumScorer") that totaled the
sum of hte boosts of the sub-queries, and compared the sum of the boosts
of the queries that match ech doc against a percentage of the total, that
would be a very simple, inexpensive, calculaiotion that would at least
allow us to leverage the user-derived measures of the score -- if not the
data-derived measures.

Does that make sense?  Does it seem like taking advantage of the Boosts
instead of just the coord would be worthwhile?

: I have previously optimized large BooleanQueries
: generated by nGrams before now by taking only the top
: idf-ranked terms - purely to reduce query times. A
: similar approach could be used to automatically
: rewrite a BooleanQuery consisting of entirely optional
: terms into the equivalent of:
: +( my high idf terms) (low idf terms)

Alas, i don't know if that is a practical solution for my situation:

1) There is no guarantee that all possible sub-Queries can be decnstructed
into Terms, so you can't rank exclusively by idf (Consider for example the
Queries Yonik submited in bug#35796)

2) Even if we confine ourselfs to simple queries consisting purely of Term
queries, your suggested approach may over emphasise Terms that aren't
particularly important to the user -- or worse, terms that the user
misspelled or miss-remembered.

imagine a user is trying to search for the digital camera "Canon EOS 5D"
.. but when they saw the name of the camera in a magazine, they didn't
realize that the "EOS" is "ee oh es" they thought it was "ee zero es" so
they search for "Canon E0S 5D"

"E0S" may not even be in the index  giving it a really high idf -- which
based on your suggestion would make it a mandatory term so the results
would be empty.  Even if we had an explicit check to ignore terms with a
docFreq of 0, there might be one product that acctually contained "E0S" in
it's name, giving the user results that contain only that product --
ignoring all Canon products or products with "5D" in their names.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message