Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 70266 invoked from network); 8 Aug 2010 20:21:48 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Aug 2010 20:21:48 -0000 Received: (qmail 66269 invoked by uid 500); 8 Aug 2010 20:21:47 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 66227 invoked by uid 500); 8 Aug 2010 20:21:46 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 66220 invoked by uid 99); 8 Aug 2010 20:21:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Aug 2010 20:21:46 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [194.109.24.38] (HELO smtp-vbr18.xs4all.nl) (194.109.24.38) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Aug 2010 20:21:38 +0000 Received: from k8u.localnet (porta.xs4all.nl [83.163.165.214]) by smtp-vbr18.xs4all.nl (8.13.8/8.13.8) with ESMTP id o78KLH6Y045009 for ; Sun, 8 Aug 2010 22:21:17 +0200 (CEST) (envelope-from paul.elschot@xs4all.nl) From: Paul Elschot To: dev@lucene.apache.org Subject: Re: passing required sub-scorers to BooleanScorer? Date: Sun, 8 Aug 2010 22:21:15 +0200 User-Agent: KMail/1.13.2 (Linux/2.6.32-24-generic; KDE/4.4.2; i686; ; ) References: <201008081713.32070.paul.elschot@xs4all.nl> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201008082221.16061.paul.elschot@xs4all.nl> X-Virus-Scanned: by XS4ALL Virus Scanner X-Virus-Checked: Checked by ClamAV on apache.org Op zondag 08 augustus 2010 18:35:47 schreef Michael McCandless: > On Sun, Aug 8, 2010 at 11:13 AM, Paul Elschot wrote: > > Op zondag 08 augustus 2010 16:04:54 schreef Michael McCandless: > >> I noticed that the SubScorer in BooleanScorer is able to handle > >> "required" clauses, and spends some CPU confirming each hit matches > >> the required clauses. > >> > >> Yet, BooleanQuery will never do so (it always uses BooleanScorer2 if > >> there are any required clauses). > >> > >> And, if I assert !required in BooleanScorer, all tests pass... so it > >> really looks to be unused code. > >> > >> Does anyone know the history here? > > > > BooleanScorer2 was introduced to use advance() (former skipTo()) > > when not all subscorers are required. > > Iirc when skipTo() was introduced it was initially only used by > > ConjunctionScorer (all required, AND type query) and PhraseScorer. > > BooleanScorer works nicely when some sub-scorers are required > > but it neither uses nor provides advance(), so it should always be > > used with some care. And when no particular sub-scorer is required > > (OR type query) BooleanScorer is the fastest one around, but it can > > score docs out of order. > > OK, but it looks like right now we never pass required clauses to BooleanScorer. Indeed. > > >> Did we used to have BooleanScorer > >> handle certain BQ's with required clauses? > > > > Before skipTo() all such BQ's were handled by BooleanScorer. > > OK. > > >> (It seems likely it could > >> give better performance in many cases, eg when the freq of the 2 > >> sub-queries are comparable). > > > > Do you mean when the 2 sub-queries have many docs in common? > > In that case an AND is almost equivalent to an OR, so > > BooleanScorer could indeed be faster than BooleanScorer2. > > I meant when the two clauses have similar freqs ("typically" this will > mean they have many docs in common, but not necessarily). > > .advance has fairly high cost, so, it really should only be used when > it's expected to save a good number of .next's. This is tricky business. For example consider the case where there are a few large gaps for one scorer in which the other scorer has lots of docs, and vice versa. For text fields one could gamble more or less reliably that that won't happen, but for numerical fields... > My guess is we should re-activate BS for scoring required clauses, in > certain cases. It will likely be much faster... the problem is it's > hard to figure out what those cases are because our Scorers can't > estimate their rough hit counts. Maybe we should add an > "estimatedCount" or something to Scorer... This has come up before, for example such an estimation is also nice to have in ConjunctionScorer when there are more than 2 sub-scorers to allow the most sparse ones to be iterated first. The current approximation for that seems to work good enough in practice, but use of this estimation would be a more reliable way. > but maybe until then, we > should comment out the BS code that tries to handle required clauses. Or at least add a comment that this code is currently unused. > BS2 will be faster in highly lopsided cases (AND of a rare term w/ a > common one). Iirc currently the default is to not use unordered scoring at all, and there have been no more than a few complaints about slow performance for which unordered scoring was faster. Regards, Paul Elschot > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org