lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: ConjunctionScorer.doNext() overstays?
Date Thu, 01 Mar 2012 13:49:55 GMT
I would have assumed the many int comparisons would cost less than the superfluous disk accesses?
(I bow to your considerable experience in this area!)
What is the worst-case scenario on added disk reads? Could it be as bad as numberOfSegments
x numberOfOtherscorers before the query winds up?
On the index I tried, it looked like an improvement - the spreadsheet I linked to has the
source for the benchmark on a second worksheet if you want to give it a whirl on a different
dataset.



----- Original Message -----
From: Michael McCandless <lucene@mikemccandless.com>
To: dev@lucene.apache.org; mark harwood <markharw00d@yahoo.co.uk>
Cc: 
Sent: Thursday, 1 March 2012, 13:31
Subject: Re: ConjunctionScorer.doNext() overstays?

Hmm, the tradeoff is an added per-hit check (doc != NO_MORE_DOCS), vs
the one-time cost at the end of calling advance(NO_MORE_DOCS) for each
sub-clause?  I think in general this isn't a good tradeoff?

Ie what about the case where we and high-freq, and similarly freq'd,
terms together?  Then, the per-hit check will at some point dominate?

It's valid to pass NO_MORE_DOCS to DocsEnum.advance.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 1, 2012 at 7:22 AM, mark harwood <markharw00d@yahoo.co.uk> wrote:
> I got round to some benchmarking of this change on Wikipedia content which shows a small
improvement:   http://goo.gl/60wJG
>
> Aside from the small performance gain to be had, it just feels more logical if ConjunctionScorer
does not issue sub scorers with a request to advance to "NO_MORE_DOCS".
>
>
>
>
> ----- Original Message -----
> From: mark harwood <markharw00d@yahoo.co.uk>
> To: "dev@lucene.apache.org" <dev@lucene.apache.org>
> Cc:
> Sent: Thursday, 1 March 2012, 9:39
> Subject: ConjunctionScorer.doNext() overstays?
>
> Due to the odd behaviour of a custom Scorer of mine I discovered ConjunctionScorer.doNext()
could loop indefinitely.
> It does not bail out as soon as any scorer.advance() call it makes reports back "NO_MORE_DOCS".
Is there not a performance optimisation to be gained in exiting as soon as this happens?
> At this stage I cannot see any point in continuing to advance other scorers - a quick
look at TermScorer suggests that any questionable calls made by ConjunctionScorer to advance
to NO_MORE_DOCS receives no special treatment and disk will be hit as a consequence.
> I added an extra condition to the while loop on the 3.5 source:
>
>     while ((doc != NO_MORE_DOCS)  && ((firstScorer = scorers[first]).docID()
< doc)) {
>
> and Junit tests passed.I haven't been able to benchmark performance improvements but
it looks like it would be sensible to make the change anyway.
>
> Cheers,
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message