lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject Re: ConjunctionScorer.doNext() overstays?
Date Thu, 01 Mar 2012 16:55:24 GMT
Fair points.
I've tried several sized indexes and blends of query term frequencies now and the results
swing only marginally between the 2 implementations.
Sometimes the "exiting early" logic is marginally faster and other times marginally slower.
Using a larger index seemed to reduce the improvement I had seen on my initial results.

So overall, not a clear improvement and not worth bothering with because, as you suggest,
various disk caching strategies probably mitigate the cost of the added reads.

Based on your comments re the added int comparison cost in that "hot" loop it made me think
that the abstract docIdSetIterator.docId() method call could be questioned on that basis too?
It looks like all DocIdSetIterator subclasses maintain a doc variable mutated elsewhere
in advance() and next() calls and docID() is meant to be idempotent so presumably a shared
variable in the base class could avoid a docID() method invocation? 
Anyhoo the profiler did not show that method up as any sort of hotspot so I don't think it's
an issue.

Thanks, Mike.

----- Original Message -----
From: Michael McCandless <>
To:; mark harwood <>
Sent: Thursday, 1 March 2012, 14:18
Subject: Re: ConjunctionScorer.doNext() overstays?

On Thu, Mar 1, 2012 at 8:49 AM, mark harwood <> wrote:
> I would have assumed the many int comparisons would cost less than the superfluous disk
accesses? (I bow to your considerable experience in this area!)
> What is the worst-case scenario on added disk reads? Could it be as bad as numberOfSegments
x numberOfOtherscorers before the query winds up?

Well, it depends -- the disk access is a one-time thing but the added
per-hit check is per-hit.  At some point it'll cross over...

I think likely the advance(NO_MORE_DOCS) will not usually hit disk:
our skipper impl fully pre-buffers (in RAM) the top skip lists I
think?  Even if we do go to disk it's likely the OS pre-cached those
bytes in its IO buffer.

> On the index I tried, it looked like an improvement - the spreadsheet I linked to has
the source for the benchmark on a second worksheet if you want to give it a whirl on a different

Maybe try it on a more balanced case?  Ie, N high-freq terms whose
freq is "close-ish"?  And on slow queries (I think the results in your
spreadsheet are very fast queries right?  The slowest one was ~0.95
msec per query, if I'm reading it right?).

In general I think not slowing down the worst-case queries is much
more important that speeding up the super-fast queries.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message