lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: lucene 2.9 sorting algorithm
Date Thu, 15 Oct 2009 10:12:10 GMT
If I remembering it right... this (matching MultiSearcher's approach)
was nearly the first thing we tried with LUCENE-1483.  But the CPU
cost was higher in our tests.  I think we had tested unbalanced and
balanced segments, but memory is definitely somewhat hazy at this
point...

I suspect even in the balanced case the cost will be higher with the
"separate PQs then merge at the end" approach.  It's not only the int
ord comparisons (which are cheap) that take CPU.  It's the if
statements that are executed as a result of the comparison when
re-heapifying the PQ.  Those are far more costly especially if the CPU
has trouble predicting which way the if will go, which it does here.

Ie, it's the insertions into each PQ that are expensive; by tracking a
separate PQ per reader you'll have many more insertions since
each PQ must re-converge.

But I agree it's not obvious which way it'll go... so why not simply
build a comparator that does this and then let the computer tell us?

We played with a number of String comparators in LUCENE-1483; it could
also be that one of those might fit better for the balanced case,
too.

Mike

On Thu, Oct 15, 2009 at 4:31 AM, Jake Mannix <jake.mannix@gmail.com> wrote:
> I had to dig through the source code (actually, walk through a unit test,
> because
> that was simpler to see what was going on in the 2.9 sorting), but I think
> John's
> way has slightly lower complexity in the balanced segment size case.
>
> On Wed, Oct 14, 2009 at 8:57 PM, Yonik Seeley <yonik@lucidimagination.com>
> wrote:
>>
>> Interesting idea... though one further piece of info in the mix is
>> that large segments are typically processed first, and tend to fill up
>> the priority queue.
>
> [noted: in the unbalanced case, things work out nicely for the 2.9 case
> compared
> to the balanced case, but M roughly equal balanced segments is common enough
> use case to consider this as non-extreme (c.f. BalancedMergePolicy).
> Of course this means that the sorting is very easy using John's technique as
> well,
> as the number of string compares in the priority queue merge at the end is
> much lower]
>
>>
>> Conversion from one segment to another is only
>> done as needed... only the bottom slot is converted automatically when
>> the segment is switched.
>
> That's not what it looks like, actually: you convert the bottom slot, and
> as soon as you get another hit from the new segment which is
> competitive, it pushes what was the old bottom out, and there's a new
> bottom, which often will be from one of the previous segments, in which
> case *it* must be converted to the new ordering as well.
>
> Additionally, whenever this happens, and a doc from the new segment
> is competitive (which means that it wins in a fast int ordinal compare
> with the bottom), it needs to do O(log(K)) slow *string* compares to find
> out where in the queue this newly competitive doc belongs (where K is
> numHits = size of the queue).
>
>>
>> Filling a complete priority queue for each segment is also more
>> expensive (from the big-O perspective).
>
> Huh?  If you have M roughly balanced segments of N docs each, and
> you want to find the top K (out of all M*N) sorted docs, John is suggesting
> (correct me if I'm wrong here John), that you keep M priority queues of
> size K, filling them in sequence: on average (and when K << N) this takes
> M * N * c_i, where c_i is the cost of an integer compare, and then you
> need to merge M sorted string lists of size K, which takes K log(M) * c_s,
> where c_s is the cost of doing a string compare.  This latter factor
> is sub-leading in N, which is the biggest number around, and the
> M * N * c_i factor is the same as the dominating factor in the current
> 2.9 sorting, because M * N * c_i is minimal cost to compare with the
> bottom of the queue across all M * N docs.
>
> So the leading in M*N factors from both approaches are the same, and
> anything without a factor of N is tiny (if N might be 10^6, M is maybe
> 10-20, K is 10-100), but I worked out the sub-leading factors on a
> sheet of paper and I think I've convinced myself that John's form
> actually wins out even there, if you keep track of how often string
> compares and conversions are required.
>
> I can write up that O(K,M) analysis if you'd like, but it should be
> mostly academic - for K, M << N, both John's approach and the
> current 2.9 have the same leading behavior and should perform
> equally well.  After all - the multiple queues approach is what
> lucene has always done for the multi-reader case, why is it that
> multiple SegmentReaders should be any different?
>
>   -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message