Something makes me curious: queries that can leverage sorted indices
should be _very_ fast, for instance in your case they only need to
look at 500 documents per segment at most (less in practice since we
stop collecting as soon as a noncompetitive hit is found), so why do
you need to parallelize query execution?
On Fri, Feb 1, 2019 at 3:18 PM Michael Sokolov <msokolov@gmail.com> wrote:
>
> I want to propose an optimization to early termination that gives nice
> speedups for large result sets when searching with multiple threads at the
> cost of a small (controllable) probability of collecting documents out of
> order: in benchmarks I see +6070% QPS for tasks like HighTermDayOfYearSort
> when topN=500, using 8 threads to search, and in our production system this
> optimization cut the latency of our slowest queries substantially.
>
> In a multiphase ranking scenario a typical pattern is to retrieve a
> largish number of matches in a first pass using indexed sort, followed by a
> second pass that reranks and selects a smaller top K, using a more
> expensive ranking. N is chosen to provide sufficient probabililty of
> finding the desired top K across the whole index, given that the index sort
> is some approximation to the desired sort. When ranking by indexed sort, as
> in TopFieldCollector, we can now earlyterminate when a sufficient number
> of matches have been found so that we only need retrieve N documents from
> each segment. In singlethreaded mode we can check against
> minCompetitiveScore and terminate collection for each segment
> appropriately, but when using multiple threads to search concurrently there
> is no such coordination and we end up collecting N documents *per segment*,
> which are then merged down to N.
>
> We do not need to collect so many documents though. For any given segment,
> let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in that
> segment. Assuming that documents are distributed randomly among segments,
> we can expect that on average we will find p*N of the top N documents in
> the given segment. If we only collect p*N documents, we will sometimes miss
> some documents that we should have collected, collecting some
> lesscompetitive documents from one segment, while not collecting all the
> competitive documents from another. But how many should we collect in order
> to make this occur only very rarely?
>
> The worst case is that all top N documents occur in a single segment. For
> even small values of N and small numbers of segments S, this probability is
> vanishingly small (N=10, S=10) > 10^(1N) = 1/10^9. More generally, this
> distribution of documents among segments is a multinomial distribution, and
> the variance of the number of documents in a single segment is that of a
> binomial distribution. The binomial variance in this case (p=probability of
> document in the segment, N number of documents) is p*(1p)*N; we can use
> this to compute the number of documents to collect per leaf in order to
> bound the probability of a ranking error. I'm using a cutoff of 3 standard
> deviations, i.e. collecting p*N + 3*(p*(1p)*N)^1/2 documents for each
> segment. For N=500, p=0.2, we can collect 67 documents instead of 500 at
> the cost of an error that occurs < 3/1000.
>
> Also note that the kind of errors we make are typically benign. In most
> cases we will return the correct top N1 documents, but instead of
> returning the Nthranked document in position N, we return the N+1st.
>
> Implementing this in Lucene requires a small patch to TopFieldCollector to
> introduce a leafHitsThreshold comparable to the existing
> totalHitsThreshold. Given the possibility of error, it might be good to
> have a way to disable this, but my inclination would be to enable it
> whenever approximate counts are enabled (ie by default), and disable when
> totalHitsThreshold is MAX_VALUE.
>
> What do you think? Shall I open an issue?

Adrien

To unsubscribe, email: javauserunsubscribe@lucene.apache.org
For additional commands, email: javauserhelp@lucene.apache.org
