lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
Date Sat, 20 Feb 2010 18:04:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836214#action_12836214
] 

Michael McCandless commented on LUCENE-2271:
--------------------------------------------

I would rather not change the core default collector here, when
sorting by score.

All of the patches being considered would add an extra comparison/and
(or maybe if) per collect call, which while presumably small, is still
an added cost.

Likely this added cost is in the noise of a benchmark so we wouldn't
be able to conclude much by benching... but enough of these added
costs will eventually add up.  At the end of the day you are adding
non-zero work for the CPU to do, for every collect.

This is Lucene's searching hotspot, so we should only be adding extra
code on this critical path when it's really needed.

It's ashame to make everyone pay that price when in practice very few
users need to collect -Inf/NaN scoring docs.  This hasn't been hit in
a "real" use case yet.

In fact, before 2.9 (ie up to and including 2.4), we by default
filtered out all docs scoring <= 0.0, so -Inf and NaN were always
filtered out, anyway.

Users who somehow do hit this in a real use case have a good
workaround: use TopFieldCollector, sorting by Relevance.  This will
properly handle -Inf, and will at least collect NaN scoring docs (but
their sort order will be random, as it always has).  Or, use
PositiveScoresOnlyCollector (to get back to Lucene 2.4 behavior).  Or,
create a custom collector.


> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results
with TopScoreDocCollector
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2271
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2271
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Priority: Minor
>             Fix For: 2.9.2, 3.0.1, 3.1
>
>         Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch,
LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0
leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr
can create these invalid scores easily. In previous version of Lucene these scores ordered
correct (except NaN, which mixes up results), but never invalid document ids are returned
(like Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with
a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must
be smaller than all posible values, which is not the case:
> - -inf is equal and the document is not inserted into the HQ, as not competitive, but
the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE
docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit
condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
>     // Since docs are returned in-order (i.e., increasing doc Id), a document
>     // with equal score to pqTop.score cannot compete since HitQueue favors
>     // documents with lower doc Ids. Therefore reject those docs too.
>     return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: all comparisons
with NaN result in false (none of these is true): x < NaN, x > NaN, NaN == NaN. This
leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering
in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit
then overrides the top of the queue but leaves the incorrect sentinels  unchanged -> invalid
results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
>     if (hitA.doc == Integer.MAX_VALUE)
>       return true;
>     if (hitB.doc == Integer.MAX_VALUE)
>       return false;
>     if (hitA.score == hitB.score)
>       return hitA.doc > hitB.doc; 
>     else
>       return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
>     if (hitA.score == hitB.score)
>       return hitA.doc > hitB.doc; 
>     else
>       return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit and the
use of sentinels is questionable. I would like to remove the sentinels and use the old pre
2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN
would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons by Float.compare()
[also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message