Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Paul Elschot <paul.elschot@xs4all.nl>
To: dev@lucene.apache.org
Subject: Re: passing required sub-scorers to BooleanScorer?
Date: Sun, 8 Aug 2010 22:21:15 +0200
User-Agent: KMail/1.13.2 (Linux/2.6.32-24-generic; KDE/4.4.2; i686; ; )
References: <AANLkTimAoTT=qmhKepGQCbf_=06vi3L-x=M-9Ok9yy6_@mail.gmail.com>
 <201008081713.32070.paul.elschot@xs4all.nl>
 <AANLkTineT08ztTNfV17L1c6=28mCvP0w_AKPGU=LoXuE@mail.gmail.com>
In-Reply-To: <AANLkTineT08ztTNfV17L1c6=28mCvP0w_AKPGU=LoXuE@mail.gmail.com>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201008082221.16061.paul.elschot@xs4all.nl>

Op zondag 08 augustus 2010 18:35:47 schreef Michael McCandless:
> On Sun, Aug 8, 2010 at 11:13 AM, Paul Elschot <paul.elschot@xs4all.nl> wrote:
> > Op zondag 08 augustus 2010 16:04:54 schreef Michael McCandless:
> >> I noticed that the SubScorer in BooleanScorer is able to handle
> >> "required" clauses, and spends some CPU confirming each hit matches
> >> the required clauses.
> >>
> >> Yet, BooleanQuery will never do so (it always uses BooleanScorer2 if
> >> there are any required clauses).
> >>
> >> And, if I assert !required in BooleanScorer, all tests pass... so it
> >> really looks to be unused code.
> >>
> >> Does anyone know the history here?
> >
> > BooleanScorer2 was introduced to use advance() (former skipTo())
> > when not all subscorers are required.
> > Iirc when skipTo() was introduced it was initially only used by
> > ConjunctionScorer (all required, AND type query) and PhraseScorer.
> > BooleanScorer works nicely when some sub-scorers are required
> > but it neither uses nor provides advance(), so it should always be
> > used with some care. And when no particular sub-scorer is required
> > (OR type query) BooleanScorer is the fastest one around, but it can
> > score docs out of order.
> 
> OK, but it looks like right now we never pass required clauses to BooleanScorer.

Indeed.

> 
> >> Did we used to have BooleanScorer
> >> handle certain BQ's with required clauses?
> >
> > Before skipTo() all such BQ's were handled by BooleanScorer.
> 
> OK.
> 
> >> (It seems likely it could
> >> give better performance in many cases, eg when the freq of the 2
> >> sub-queries are comparable).
> >
> > Do you mean when the 2 sub-queries have many docs in common?
> > In that case an AND is almost equivalent to an OR, so
> > BooleanScorer could indeed be faster than BooleanScorer2.
> 
> I meant when the two clauses have similar freqs ("typically" this will
> mean they have many docs in common, but not necessarily).
> 
> .advance has fairly high cost, so, it really should only be used when
> it's expected to save a good number of .next's.

This is tricky business. For example consider the case where there
are a few large gaps for one scorer in which the other scorer has lots
of docs, and vice versa. For text fields one could gamble more or less
reliably that that won't happen, but for numerical fields...

> My guess is we should re-activate BS for scoring required clauses, in
> certain cases.  It will likely be much faster... the problem is it's
> hard to figure out what those cases are because our Scorers can't
> estimate their rough hit counts.  Maybe we should add an
> "estimatedCount" or something to Scorer...

This has come up before, for example such an estimation is also nice to
have in ConjunctionScorer when there are more than 2 sub-scorers
to allow the most sparse ones to be iterated first. The current approximation
for that seems to work good enough in practice, but use of this estimation
would be a more reliable way.

> but maybe until then, we
> should comment out the BS code that tries to handle required clauses.

Or at least add a comment that this code is currently unused.

> BS2 will be faster in highly lopsided cases (AND of a rare term w/ a
> common one).

Iirc currently the default is to not use unordered scoring at all, and
there have been no more than a few complaints about slow performance
for which unordered scoring was faster.

Regards,
Paul Elschot

> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org