I had to dig through the source code (actually, walk through a unit test,
because
that was simpler to see what was going on in the 2.9 sorting), but I think
John's
way has slightly lower complexity in the balanced segment size case.
On Wed, Oct 14, 2009 at 8:57 PM, Yonik Seeley <yonik@lucidimagination.com>wrote:
> Interesting idea... though one further piece of info in the mix is
> that large segments are typically processed first, and tend to fill up
> the priority queue.
[noted: in the unbalanced case, things work out nicely for the 2.9 case
compared
to the balanced case, but M roughly equal balanced segments is common enough
use case to consider this as nonextreme (c.f. BalancedMergePolicy).
Of course this means that the sorting is very easy using John's technique as
well,
as the number of string compares in the priority queue merge at the end is
much lower]
> Conversion from one segment to another is only
> done as needed... only the bottom slot is converted automatically when
> the segment is switched.
>
That's not what it looks like, actually: you convert the bottom slot, and
as soon as you get another hit from the new segment which is
competitive, it pushes what was the old bottom out, and there's a new
bottom, which often will be from one of the previous segments, in which
case *it* must be converted to the new ordering as well.
Additionally, whenever this happens, and a doc from the new segment
is competitive (which means that it wins in a fast int ordinal compare
with the bottom), it needs to do O(log(K)) slow *string* compares to find
out where in the queue this newly competitive doc belongs (where K is
numHits = size of the queue).
> Filling a complete priority queue for each segment is also more
> expensive (from the bigO perspective).
Huh? If you have M roughly balanced segments of N docs each, and
you want to find the top K (out of all M*N) sorted docs, John is suggesting
(correct me if I'm wrong here John), that you keep M priority queues of
size K, filling them in sequence: on average (and when K << N) this takes
M * N * c_i, where c_i is the cost of an integer compare, and then you
need to merge M sorted string lists of size K, which takes K log(M) * c_s,
where c_s is the cost of doing a string compare. This latter factor
is subleading in N, which is the biggest number around, and the
M * N * c_i factor is the same as the dominating factor in the current
2.9 sorting, because M * N * c_i is minimal cost to compare with the
bottom of the queue across all M * N docs.
So the leading in M*N factors from both approaches are the same, and
anything without a factor of N is tiny (if N might be 10^6, M is maybe
1020, K is 10100), but I worked out the subleading factors on a
sheet of paper and I think I've convinced myself that John's form
actually wins out even there, if you keep track of how often string
compares and conversions are required.
I can write up that O(K,M) analysis if you'd like, but it should be
mostly academic  for K, M << N, both John's approach and the
current 2.9 have the same leading behavior and should perform
equally well. After all  the multiple queues approach is what
lucene has always done for the multireader case, why is it that
multiple SegmentReaders should be any different?
jake
