lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: lucene 2.9 sorting algorithm
Date Thu, 15 Oct 2009 21:12:12 GMT
Nice results!  Comments below...

On Thu, Oct 15, 2009 at 3:58 PM, John Wang <john.wang@gmail.com> wrote:
> Hi guys:
>
>     I did some Big O math a few times and reached the same conclusion Jake
> had.
>
>     I was not sure about the code tuning opportunities we could have done
> with the MergeAtTheEnd method as Yonik mentioned and the internal behavior
> with PQ Mike suggested, so I went ahead and implemented the algorithm and
> compared against the current Lucene 2.9 implementation:
>
> set up:
> index size: 999992 docs
> segment count: 8
> sorting on 1 field, no score, e.g. comparing against
> "OneComparatorNonScoringCollector" class.
> TermQuery used on field with only 2 values (even/odd), returning 499992
> docs.

What text field are you sorting on?  (Ie where did it come from -- is
it a natural or synthetic source?).

> I broke the tests into two parts, numHits = 20 and numHits = 50, for each, I
> ran 100 iterations, took the time from 6 - 100 (to avoid field cache load
> time) and take the average. Did it 3 times and took that average.

Can you try numHits=10 and 20?  I'm curious how this difference scales
down.  I believe most searches don't go beyond page 1.

It'd be great to see sorting by other typed (simple numerics) fields,
and queries matching different # results, too.

> Here are the numbers (times are measured in nanoseconds):
>
> numHits = 50:
>
> Lucene 2.9/OneComparatorNonScoringCollector:
> num string compares: 251
> num conversions: 34
> time: 15272049
> cpu time:  161105
> num inserts into PQ: 211
>
> My test sort algorithm:
> num string compares: 51
> num conversions: 0
> time: 14943172
> cpu time: 153722
> num inserts into PQ: 1500
>
>
> numHits = 100:
>
>
> Lucene 2.9/OneComparatorNonScoringCollector:
> num string compares: 2285
> num conversions: 172
> time: 17703866
> cpu time:  187073
> num inserts into PQ: 982
>
> My test sort algorithm:
> num string compares: 210
> num conversions: 0
> time: 15823473
> cpu time: 160737
> num inserts into PQ: 6011

These are nice results.  Single PQ does looks faster for this case.

> Conclusion:
>
> My measurement is consistent with mine and Jake's Big O analysis, where the
> dominating factor is the same between both algorithms. And in terms of
> timing/cpu, they are very similar with the second algorithm consistently
> perform a slightly better than Lucene 2.9.
> Mike is absolutely correct about the number of PQ insertions, but that win
> seems to be shadowed by the cost of extra string compares and conversions.
> If you look at the "growth" between numHits=20 and 100, the time gap
> increases, e.g. as the PQ gets larger (case where user navigates to a higher
> page of the search result) the num string compares increase faster than the
> increase of the size of PQ.
> There is an assumption that the conversion cost is low, which may not be
> true for custom sort extensions, and could easily nominate the entire query
> time. (more analysis below)
>
> Given this analysis, seems like at the least the second algorithm is not
> worse in terms of performance, and I would argue it is actually better.

I agree, so far!

> With the current API which ties into this algorithm in Lucene 2.9
> (FieldComparator), there is actually a subtle behavior change, such that, to
> implement a custom sort, user would have to be able to do this conversion,
> which involves a reverse mapping from value to ordinals. This is a change on
> the contract for writing custom sort comparison, where before,
> ordinal->value mapping can be 1 to many, and now, with the requirement for
> this reverse mapping, it is much more difficult and expensive to do.

But a custom comparator need not do the reverse conversion when
implementing FieldComparator?  Ie, it can implement the conversion
however it wants?  Also one can always implement their own Collector
to do the sorting.

> If you agree, I think it is worthwhile to open a jira ticket for this while
> this FieldComparator API is still fresh. And I'd be happy to provide my
> implementation, test code and my index.

+1

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message