lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
Date Fri, 24 Apr 2009 12:45:30 GMT


Shai Erera commented on LUCENE-1593:

Few updates (it's been long since I posted on this issue):
* I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless
I missed something). The only reason to do it was to get rid of the 'if (queueFull)' check
in every collect. However, it turned out that pre-populating the queue for TFC with sentinel
values is unreliable. Reason is, if I want to get rid of that 'if', I don't want to add any
'if' to, so the sentinel values should be something like MIN_VALUE
or MAX_VALUE (depends on the value of 'reverse'). However, someone can set a field value to
either of these values (and there are tests that do), so I need to check if FC if the current
value is a sentinel, which adds that 'if' back, only worse - it's now executed in every compare()
call. Unless I missed something, I don't think that's possible, or at least worth the effort
(to get rid of one 'if').
** BTW, in TSDC I use Float.NEG_INF as a sentinel value. This might be broken if a Scorer
decided to return that value, in which case pre-populating the queue will not work either.
I think it is still safe in TSDC, but want to get your opinion.

* I changed Sort.setSort() to not add SortField.FIELD_DOC, as suggested by Mike. But then
TestSort.testMultiSort failed. So I debugged it and either the test works fine but there's
a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll
be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired
behavior, only in MultiSearcher).
** MultiSearcher's search(Weight, Filter, int, Sort) method executes a search against all
its Searchables, sequentially. The test adds documents to two indexes, odd and even, so that
the odd index is added two documents (A and E) with the value "5" in docs 0 and 2, and the
even index is added one doc (B) with value "5" in index 0. When I use the 2 SortField version
(2nd sorts by doc), the output is ABE, since the compare by doc uses the searcher's doc Ids
(0, 0, 2) and B is always less than E and so preferred, even though its 'fixed' doc Id is
7 (it appears in the 2nd Searcher). When I use the 1 SortField version, the result is AEB,
since B's fixed doc Id is 7, and now the code breaks a tie on their *true* doc Id.

I hope you were able to follow my description. That's why I don't know which one is the true
output ABE or AEB. I tend to say AEB, since those are the true doc Ids the application will
get in the end. Few more comments:
# TestSort.runMultiSort contains 3 versions of this test:
#* One that sorts by the INT value only, relying on the doc Id tie breaking. You can see the
output was not determined to be exact and so the pattern included is [ABE]{3}, as if the test's
output is not predicted. I think that's wrong in the first place, we should always have predicted
tests output, since we're not involving any randomness in that code.
#* The second explicitly sets INT followed by DOC Sorts, and expects [AB]{2}E.
#* The third relies on setSort's adding DOC, so it expects the same [AB]{2}E.
# The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but doesn't update
the DOC's field value the same as it does to the scoreDoc.doc value (adding the current Searcher's

Again, whatever the right output is, changing Sort to not include SortField.FIELD_DOC might
result in someone experiencing a change in behavior (if he used MultiSearcher), even if that
behavior is buggy.

What do you think?

> Optimizations to TopScoreDocCollector and TopFieldCollector
> -----------------------------------------------------------
>                 Key: LUCENE-1593
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>             Fix For: 2.9
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove
unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always
have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove
the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" for the
last SortField. But, doing so should not be necessary (since we already break ties by docID),
and is in fact less efficient (once the above optimization is in).
> # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add
a addDummyObjects method which will populate the queue without "arranging" it, just store
the objects in the array (this can be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message