lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: lucene 2.9 sorting algorithm
Date Thu, 15 Oct 2009 21:33:38 GMT
On Thu, Oct 15, 2009 at 3:52 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
>
> On Thu, Oct 15, 2009 at 3:12 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>
>> If I remembering it right... this (matching MultiSearcher's approach)
>> was nearly the first thing we tried with LUCENE-1483.  But the CPU
>> cost was higher in our tests.  I think we had tested unbalanced and
>> balanced segments, but memory is definitely somewhat hazy at this
>> point...
>
> Yeah, from what I could tell, the way you used the PQ api, comparing
> only with the bottom until you saw it beat the bottom, instead of always
> just addWithOverflow, leads to some significant performance improvement,
> but this can be done with the MultiSearcher / multiple PQ approach as
> well - like I said, I'm pretty sure the overall complexity is the same
> (ie. leading complexity, with the factor of numDocs = N * M out front), and
> we're just talking about sub-leading behavior at this point.

I agree.

>> I suspect even in the balanced case the cost will be higher with the
>> "separate PQs then merge at the end" approach.  It's not only the int
>> ord comparisons (which are cheap) that take CPU.  It's the if
>> statements that are executed as a result of the comparison when
>> re-heapifying the PQ.  Those are far more costly especially if the CPU
>> has trouble predicting which way the if will go, which it does here.
>>
>> Ie, it's the insertions into each PQ that are expensive; by tracking a
>> separate PQ per reader you'll have many more insertions since
>> each PQ must re-converge.
>
> Insertions are expensive - totally agreed, and subtle branching tweaks
> make a big difference at subleading level.  So we need to do a careful
> analysis of the sub-leading behavior (O(K,M) but O(1) in terms of N):
>
> The multi-PQ approach requires more insertions: O(M * K log(K))
> insertions, but each one can be made very fast, with careful branch
> tuning similar to the tweaks currently in the 2.9 code - as all of the
> comparisons are ints.  The final merging requires K log(M) string
> compares as well.

I don't think we do any branch tuning on the PQ insertion -- the ifs
involved in re-heapifying the PQ are simply hard for the CPU to
predict (though, apparently, not as hard as comparing strings ;).

> The single-PQ approach requires only O(K * log(K) * log(M))
> insertions, but each one requires, instead of O(log(K)) int compares,
> O(log(K)) *string* compares, and a segment conversion.

Right, except, for a large segment, if it has enough insertions, as
more entries in the PQ belong to that segment, more of the comparisons
become ints.

> Plugging in some numbers, with K = 100, and M = 20
>
>   multiPQ : 14,000 insertions which have 7 int compares each,
>                   followed by 400 string compares for the merge
> vs
>
>   singlePQ: 2,800 insertions which have maybe 4 string compares
>                    and 3 int compares each
>
> ---------
>
> So yeah, executive summary in theory: *at sub-leading order*, Yonik
> is right, singlePQ has less overall operations (log(M) vs M), but it
> has way more slow string compares (K log(K)^2 log(M)  vs K log(M) ).
>
>>
>> But I agree it's not obvious which way it'll go... so why not simply
>> build a comparator that does this and then let the computer tell us?
>
> I agree completely - this is all talk until I see it run on an actual index.
> I know John was writing up some code to port bobo to 2.9, so perhaps
> you can try and compare these two, with a couple different values of K,
> John?
>
> I guess a bigger question, than just the sub-leading behavior (if we're
> talking 3.5 ms vs. 3.1ms, this is really academic), is if the overall
> performance is basically the same (which I'm guessing is the case),
> then for developers, writing and understanding Comparators for the
> old API is *way* easier than with the current one, and the multi-PQ
> technique makes it totally easy to use the same logic and understanding
> that already exists for MultiSearcher.
>
> Of course, if my calculations above are wrong, or the sub-leading
> behavior is dominated by the branching and string compares are super
> fast, then difficulty of API be damned, we stick with what performs
> best.

I agree, if the gains are minor (or, non-existent!), across varying
sort field types, number of hits, number of segments, etc., then the
old API is MUCH easier to grok than the new one, and we should switch
back.  API simplicity is important.

Looking back through LUCENE-1483, we were seeing much more sizable
gains in the new API vs the old single-PQ-only-int-comparison approach
(which should be faster than both the single and multi PQ approaches
we are discussing here).  BUT: those tests combined searching by
segment AND the new FieldCollector API, plus other more basic code
speedups.  Though it'd be odd if the switch to searching by segment
really was most of the gains here.  I'm confused (for now) on where
the gains we were seeing in that issue came from...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message