lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: searchAfter is missing results when custom noncontinuous slices are used
Date Thu, 25 May 2017 14:11:53 GMT
Yes, there is a (hidden) assumption in TopDocs.merge that the hits it's
merging are logically non-overlapping, sequential slices of the index, but
in your case they are "interleaved".

TopDocs.merge doesn't otherwise trust the incoming docID to be from the
same docID space, and in your case it is.

Maybe we could improve TopDocs.merge to optionally use the already global
docID for tie breaking?

Yes, please open an issue.  Maybe we just improve the javadocs as you
suggested, but the situation sure is trappy today.

Thanks,

Mike McCandless

http://blog.mikemccandless.com

On Wed, May 24, 2017 at 10:06 AM, Christoph Kaser <lucene_list@iconparc.de>
wrote:

> Hello everybody,
>
> I have observed an unexpected behavior in Lucene, and I am unsure whether
> this is a bug, or a missing warning in the documentation:
>
> I am using the IndexSearcher with an ExecutorService in order to take
> advantage of multiple CPU cores during the searches. I want to limit the
> number of cores a single search can occupy, so I have overwritten the
> IndexSearcher method
>     protected LeafSlice[] slices(List<LeafReaderContext> leaves)
> to return a fixed number of Slices. (e.g. 4).
>
> I tried to create slices that are about the same size by looping over the
> leaves (ordered by size descending) and adding the current leaf to the
> slice with the smallest number of documents.
>
> This worked well, until I stumbled upon a query for which searchAfter
> seemed to skip hits, so that the total number of hits obtained by multiple
> calls to searchAfter was lower than TopDocs.totalHits.
>
> The issue seems to be how searchAfter works vs how TopDocs.merge works:
>
> searchAfter skips every document with a higher score than the "after"
> document. In case of equal scores, it uses the document id and skips every
> document with a <= document id (see PagingFieldCollector).
>
> TopDocs.merge uses the score to determine which hits should be part of the
> merged TopDocs. In case of equal scores, it uses the shard index (this
> corresponds to the slices the IndexSearcher uses) to break ties (see
> ScoreMergeSortQueue.lessThan)
>
> So if the shards are noncontinuous (as they are in my case), searchAfter
> uses a different way of sorting the documents than TopDocs.merge, and
> therefore hits are skipped.
>
> Here are my questions:
>
> * Are slices meant to be continuous "sublists" of the passed leaves-list?
> Or is my way of slicing meant to be supported?
> * If my way of slicing is not supported, could you either add a warning to
> the javadocs of the slices method or maybe even add  a check for a legal
> return value of slices()?
> * Should I create a jira issue for this?
>
> Sorry for the wall of text, I hope I explained the problem in an
> understandable way!
>
> Thank you and best regards
> Christoph
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message