lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
Date Mon, 26 Oct 2009 10:38:59 GMT


Michael McCandless commented on LUCENE-1997:

Looking more at this, I think there are a few problems with the
current test:

  * The "random" index is smallish -- it has only 7 segments.  I'd
    like to test across a wider range of segments.

  * My "optimized" code for the multi-PQ (old) API is too optimized --
    I conflated the comparator directly into the PQ (eg I have
    SortByIntQueue, SortByStringQueue, that directly subclass
    DocIDPriorityQueue and override only compare).  For source code
    specialization this would be appropriate (it is "correct"), but in
    order to test the two extension APIs, it's not.  I'll rework it to
    pull it back out into a separate comparator.

  * We should test more realistic queries & index.  Of all tested so
    far, query "1" on wiki index is the most realistic, and we see the
    least gains there.  The *:* query is unnatural, in part because
    the I think first N title in wiki'd index appear to be roughly
    alpha sorted.  Random strings & random ints are very realistic.

  * We need more results from diverse envs -- eg Windows (different
    versions), different Linux's, JREs, etc.

Also, I really do not like the large perf hits at highish topN sizes.
Admittedly it's unusual (though I suspect not THAT rare) to use such a
large topN, but, I esepically don't like that it doesn't degrade
gracefully -- it's surprising.  Likewise I'd like to see with highish
number of segments how things degrade.

I would also give more weight to the JRE 1.6, 64 bit, results.  Yes we
can debate what's the popular JRE in production today, but this API is
going to last a looong time in Lucene so I think we should bias
towards what will be the future JRE.  Maybe we should even test on JDK
7 preview... though it's probably too early to make any decisisions
based in its performance.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>                 Key: LUCENE-1997
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
>     segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
>     random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message