From java-user-return-64172-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Sun Feb  3 15:41:22 2019
Return-Path: <java-user-return-64172-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 57370180626
	for <archive-asf-public@cust-asf.ponee.io>; Sun,  3 Feb 2019 16:41:21 +0100 (CET)
Received: (qmail 80042 invoked by uid 500); 3 Feb 2019 15:41:20 -0000
Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:java-user-help@lucene.apache.org>
List-Unsubscribe: <mailto:java-user-unsubscribe@lucene.apache.org>
List-Post: <mailto:java-user@lucene.apache.org>
List-Id: <java-user.lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Delivered-To: mailing list java-user@lucene.apache.org
Received: (qmail 80030 invoked by uid 99); 3 Feb 2019 15:41:19 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Feb 2019 15:41:19 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EEBE6C9AD7
	for <java-user@lucene.apache.org>; Sun,  3 Feb 2019 15:41:18 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.797
X-Spam-Level: *
X-Spam-Status: No, score=1.797 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id W710lJq56kBn for <java-user@lucene.apache.org>;
	Sun,  3 Feb 2019 15:41:16 +0000 (UTC)
Received: from mail-ed1-f68.google.com (mail-ed1-f68.google.com [209.85.208.68])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 416EB61117
	for <java-user@lucene.apache.org>; Sun,  3 Feb 2019 15:41:16 +0000 (UTC)
Received: by mail-ed1-f68.google.com with SMTP id y20so9209932edw.9
        for <java-user@lucene.apache.org>; Sun, 03 Feb 2019 07:41:16 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=VoE4pBJ3hcQEpbiuf3NAveL3t0M8PuhqPfZSd5Emi8o=;
        b=H1hyFhfPVSv6ygZTg3+mMfElyKLrcX9zfqONs7HClJuakaEDy/tHvylhOymotVgz/o
         89MJ2I2ZjUhCF7Z3EDvQOhBQRvRrsSRze0uA7r6BU9psjnVWlg0xuVBcxO0GDFi/zTe7
         aDNRb0WCAgOvD5RLM2fFaqeg0p/1a3N9RlWqU+4ZamS+iNkdloNypt4fK48p3IxZV71t
         YqOikLvKdqGh2lkR5U6wqOpOL4OXiGDrNLJmlYF6lfZB5oc2SONfvcgIIeyWzQ/mu9K7
         qbLmJk8J48iymSSohNmYSI940jMLDdEQTremuK3LI6tRuW6sC6QtGQuq3780cWnqV2Xh
         o6OQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=VoE4pBJ3hcQEpbiuf3NAveL3t0M8PuhqPfZSd5Emi8o=;
        b=TYC8Z7NbyLE7TbhXP5OivV7e88VXB78vhcESTV7zbNiRlLoun/UZ2MxbttZCsM8yxr
         idJcKiEz1OqgZiaI7YOziLcwW5vasxTlFpKVsCA/uUKPO2HL3TpNaeuOAIZFlM+X4CZx
         RvEaOX8ix5okYQqTv7LjF/dhn+H26Ha6nA+nR8rYaIRr0YSMyAMbYEQ1CzF/nMXzTdcR
         Ej1JBs9Cu2n8rpYWq89uDSPqEntKRqcPHVHAjJ35oIEsWHSWOmdsTDpyk+jTK1+xi7Je
         Dyr65vxkMc4eF8Gi0IJZalelQLTw3MS+/D2eqOTVRJcFe+k7A7aHSUU0Mq9D2oWGwxuf
         aHdg==
X-Gm-Message-State: AJcUukfH3nVTZiWblxnjnOJ/pwrxMQAKKlHL8Gp/TTXIgHSxcJ8LummW
	0EiWV1xZ0NYuS/wjtOKEG+SbFjsy4tFvdV9SAwpantSh
X-Google-Smtp-Source: ALg8bN4Jd3Neqo5ZSgRKouULsQ3Gr5GO3pLaT9YkX55MnZEwbSmjlc9ohRj0tDrHhZaQUzC7YkjnS3f2vIDwPuItOSo=
X-Received: by 2002:a50:b2e1:: with SMTP id p88mr46238082edd.254.1549208475225;
 Sun, 03 Feb 2019 07:41:15 -0800 (PST)
MIME-Version: 1.0
References: <CAGUSZHDURt8HEkMgf8WDbi+jNEyr67mfESzmco9bzq=pmHkP9Q@mail.gmail.com>
 <CAPsWd+ObxmoBFkFbTN7Le50HfZf9ddPewSopb29kQWSjSkHFsg@mail.gmail.com>
 <CAGUSZHBH5pV_s6Obujv4b68PsrrPtTk2xA7CHaWh3Vu=cc+ZLw@mail.gmail.com> <CAL8PwkaLRtfvNfO99kvQzEvy8WvHoFfBqAOwaqXJD3h02gN86Q@mail.gmail.com>
In-Reply-To: <CAL8PwkaLRtfvNfO99kvQzEvy8WvHoFfBqAOwaqXJD3h02gN86Q@mail.gmail.com>
From: Michael Sokolov <msokolov@gmail.com>
Date: Sun, 3 Feb 2019 10:41:01 -0500
Message-ID: <CAGUSZHD_D9kXGt66ssm9Gi1L_csRLsp2XK3UMZmYvMsvac3FTQ@mail.gmail.com>
Subject: Re: prorated early termination
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary="000000000000d951bc0580ff3496"

--000000000000d951bc0580ff3496
Content-Type: text/plain; charset="UTF-8"

 > > In single-threaded mode we can check against minCompetitiveScore and
terminate collection for each segment appropriately,

> Does Lucene do this today by default?  That should be a nice optimization,
and it'd be safe/correct.

Yes, it does that today (in TopFieldCollector -- see
https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
)

Re: our high cost of collection in static ranking phase -- that is true,
Mike, but I do also see a nice improvement on the luceneutil benchmark
(modified to have a sorted index and collect concurrently) using just a
vanilla TopFieldCollector. I looked at some profiler output, and it just
seems to be showing more time spent walking postings.

-Mike Sokolov

On Sun, Feb 3, 2019 at 10:11 AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> I think this is because our per-hit cost is sometimes very high -- we have
> "post filters" that are sometimes very restrictive.  We are working to get
> those post-filters out into an inverted index to make them more efficient,
> but net/net reducing how many hits we must collect for each segment can
> help latencies and throughput.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Feb 1, 2019 at 11:42 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
> > That's a good question. I don't have a very satisfying answer other than
> to
> > say we saw some improvements, and I would have to dig more now to say
> why.
> > It may be that in our system we have some additional per-document costs
> in
> > a custom collector that were saved by this reduction, and that's why we
> saw
> > a latency reduction. However I also note these luceneutil benchmarks
> > results:
> >
> > ** This is with -topN=500, and a searcher threadpool=8
> > Report after iter 19:
> >                     TaskQPS baseline      StdDevQPS candidate
> > StdDev                Pct diff
> >    HighTermDayOfYearSort      391.23      (2.4%)      627.92      (4.8%)
> > 60.5% (  52% -   69%)
> >
> > ** This is with -topN=500, and no searcher threadpool
> > Report after iter 19:
> >                     TaskQPS baseline      StdDevQPS candidate
> > StdDev                Pct diff
> >    HighTermDayOfYearSort      239.52      (3.3%)      232.70      (3.0%)
> > -2.8% (  -8% -    3%)
> >
> > which show QPS improvement from using threads even in the baseline case
> > (and then an additional improvement from prorating).
> >
> > On Fri, Feb 1, 2019 at 11:28 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > > Something makes me curious: queries that can leverage sorted indices
> > > should be _very_ fast, for instance in your case they only need to
> > > look at 500 documents per segment at most (less in practice since we
> > > stop collecting as soon as a non-competitive hit is found), so why do
> > > you need to parallelize query execution?
> > >
> > > On Fri, Feb 1, 2019 at 3:18 PM Michael Sokolov <msokolov@gmail.com>
> > wrote:
> > > >
> > > > I want to propose an optimization to early termination that gives
> nice
> > > > speedups for large result sets when searching with multiple threads
> at
> > > the
> > > > cost of a small (controllable) probability of collecting documents
> out
> > of
> > > > order: in benchmarks I see +60-70% QPS for tasks like
> > > HighTermDayOfYearSort
> > > > when topN=500, using 8 threads to search, and in our production
> system
> > > this
> > > > optimization cut the latency of our slowest queries substantially.
> > > >
> > > > In a multi-phase ranking scenario a typical pattern is to retrieve a
> > > > largish number of matches in a first pass using indexed sort,
> followed
> > > by a
> > > > second pass that re-ranks and selects a smaller top K, using a more
> > > > expensive ranking. N is chosen to provide sufficient probabililty of
> > > > finding the desired top K across the whole index, given that the
> index
> > > sort
> > > > is some approximation to the desired sort. When ranking by indexed
> > sort,
> > > as
> > > > in TopFieldCollector, we can now early-terminate when a sufficient
> > number
> > > > of matches have been found so that we only need retrieve N documents
> > from
> > > > each segment. In single-threaded mode we can check against
> > > > minCompetitiveScore and terminate collection for each segment
> > > > appropriately, but when using multiple threads to search concurrently
> > > there
> > > > is no such coordination and we end up collecting N documents *per
> > > segment*,
> > > > which are then merged down to N.
> > > >
> > > > We do not need to collect so many documents though. For any given
> > > segment,
> > > > let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in
> > > that
> > > > segment. Assuming that documents are distributed randomly among
> > segments,
> > > > we can expect that on average we will find p*N of the top N documents
> > in
> > > > the given segment. If we only collect p*N documents, we will
> sometimes
> > > miss
> > > > some documents that we should have collected, collecting some
> > > > less-competitive documents from one segment, while not collecting all
> > the
> > > > competitive documents from another. But how many should we collect in
> > > order
> > > > to make this occur only very rarely?
> > > >
> > > > The worst case is that all top N documents occur in a single segment.
> > For
> > > > even small values of N and small numbers of segments S, this
> > probability
> > > is
> > > > vanishingly small (N=10, S=10) -> 10^(1-N) = 1/10^9. More generally,
> > this
> > > > distribution of documents among segments is a multinomial
> distribution,
> > > and
> > > > the variance of the number of documents in a single segment is that
> of
> > a
> > > > binomial distribution. The binomial variance in this case
> > (p=probability
> > > of
> > > > document in the segment, N number of documents) is p*(1-p)*N; we can
> > use
> > > > this to compute the number of documents to collect per leaf in order
> to
> > > > bound the probability of a ranking error. I'm using a cutoff of 3
> > > standard
> > > > deviations, i.e.  collecting p*N + 3*(p*(1-p)*N)^1/2 documents for
> each
> > > > segment. For N=500, p=0.2, we can collect 67 documents instead of 500
> > at
> > > > the cost of an error that occurs < 3/1000.
> > > >
> > > > Also note that the kind of errors we make are typically benign. In
> most
> > > > cases we will return the correct top N-1 documents, but instead of
> > > > returning the Nth-ranked document in position N, we return the N+1st.
> > > >
> > > > Implementing this in Lucene requires a small patch to
> TopFieldCollector
> > > to
> > > > introduce a leafHitsThreshold comparable to the existing
> > > > totalHitsThreshold. Given the possibility of error, it might be good
> to
> > > > have a way to disable this, but my inclination would be to enable it
> > > > whenever approximate counts are enabled (ie by default), and disable
> > when
> > > > totalHitsThreshold is MAX_VALUE.
> > > >
> > > > What do you think? Shall I open an issue?
> > >
> > >
> > >
> > > --
> > > Adrien
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

--000000000000d951bc0580ff3496--