Hi guys:
I did some Big O math a few times and reached the same conclusion Jake
had.
I was not sure about the code tuning opportunities we could have done
with the MergeAtTheEnd method as Yonik mentioned and the internal behavior
with PQ Mike suggested, so I went ahead and implemented the algorithm and
compared against the current Lucene 2.9 implementation:
set up:
index size: 999992 docs
segment count: 8
sorting on 1 field, no score, e.g. comparing against
"OneComparatorNonScoringCollector" class.
TermQuery used on field with only 2 values (even/odd), returning 499992
docs.
I broke the tests into two parts, numHits = 20 and numHits = 50, for each, I
ran 100 iterations, took the time from 6  100 (to avoid field cache load
time) and take the average. Did it 3 times and took that average.
Here are the numbers (times are measured in nanoseconds):
numHits = 50:
Lucene 2.9/OneComparatorNonScoringCollector:
num string compares: 251
num conversions: 34
time: 15272049
cpu time: 161105
num inserts into PQ: 211
My test sort algorithm:
num string compares: 51
num conversions: 0
time: 14943172
cpu time: 153722
num inserts into PQ: 1500
numHits = 100:
Lucene 2.9/OneComparatorNonScoringCollector:
num string compares: 2285
num conversions: 172
time: 17703866
cpu time: 187073
num inserts into PQ: 982
My test sort algorithm:
num string compares: 210
num conversions: 0
time: 15823473
cpu time: 160737
num inserts into PQ: 6011
Conclusion:
My measurement is consistent with mine and Jake's Big O analysis, where the
dominating factor is the same between both algorithms. And in terms of
timing/cpu, they are very similar with the second algorithm consistently
perform a slightly better than Lucene 2.9.
Mike is absolutely correct about the number of PQ insertions, but that win
seems to be shadowed by the cost of extra string compares and conversions.
If you look at the "growth" between numHits=20 and 100, the time gap
increases, e.g. as the PQ gets larger (case where user navigates to a higher
page of the search result) the num string compares increase faster than the
increase of the size of PQ.
There is an assumption that the conversion cost is low, which may not be
true for custom sort extensions, and could easily nominate the entire query
time. (more analysis below)
Given this analysis, seems like at the least the second algorithm is not
worse in terms of performance, and I would argue it is actually better.
With the current API which ties into this algorithm in Lucene 2.9
(FieldComparator), there is actually a subtle behavior change, such that, to
implement a custom sort, user would have to be able to do this conversion,
which involves a reverse mapping from value to ordinals. This is a change on
the contract for writing custom sort comparison, where before,
ordinal>value mapping can be 1 to many, and now, with the requirement for
this reverse mapping, it is much more difficult and expensive to do.
If you agree, I think it is worthwhile to open a jira ticket for this while
this FieldComparator API is still fresh. And I'd be happy to provide my
implementation, test code and my index.
Thanks
John
On Thu, Oct 15, 2009 at 10:06 AM, Jake Mannix <jake.mannix@gmail.com> wrote:
> On Thu, Oct 15, 2009 at 9:12 AM, Yonik Seeley <yonik@lucidimagination.com>wrote:
>
>> On Thu, Oct 15, 2009 at 11:53 AM, Yonik Seeley
>> <yonik@lucidimagination.com> wrote:
>> > And it seems like a PQ per segment simply delays many of the slow
>> > lookups until the end where the PQs must be merged.
>>
>> Actually, I'm wrong about that part  one can simply merge on
>> values... there will be lots of string comparisons (and a new merge
>> queue) but no binary searches.
>>
>
> The number of string comparisons in the final merge of queues is
> O(K log(M) ) if you do as you're describing (have basically a PQ of sorted
> lists, the same way BooleanScorer works), where K is going to be roughly
> what, 10, 100 (numResultsToReturn), and M is 1050? This is like a couple
> thousand string compares, tops, and independent of the overall size of the
> segments  it's measured in microseconds of CPU time.
>
> jake
>
