lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
Date Sun, 25 Oct 2009 20:31:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769860#action_12769860
] 

Jake Mannix commented on LUCENE-1997:
-------------------------------------

Mark, you say with the previous numbers, you'd say "-1", but if you look at the most common
use case (top 10), the simpler API is faster in almost all cases, and in some cases it's 10-20%
faster.   Top 500, top 1000 are not only just "not as common", they're probably at the 1%
level, or less.

As far as shifting back, API-wise, that really shouldn't be a factor: 2.9 *just* came out,
and what, we stick with a slightly *slower* API (for the most common use case across all Lucene
users), which happens to be *more complex*, and more importantly: just very nonstandard -
Comparable is very familiar to everyone, even if you have to have two forms, one for primitives,
one for Objects - an api which *doesn't* have the whole slew of compare(), compareBottom(),
copy(), setBottom(), value() and setNextReader() has a tremendous advantage over one which
does.  

It's "advanced" to implement a custom sort, but it will be *easier* if it's not complex, and
then it doesn't *need* to be "advanced" (shouldn't we be striving to make there be less APIs
which are listed as "advanced", and instead more features which can *do* complex things but
are still listed as things "normal users" can do).

I think it's *great* precedent to set with users to say, "oops!  we found that this new (just
now as of this version) api was unnecessarily clumsy, we're shifting back to a simpler one
which is just like the one you used to have".  Sticking with a worse api because it performs
better in only extreme scenarios because "we already moved on to this new api, shouldn't go
back now, don't want to admit we ever made a mistake!" is what is "ugly".

The main thing to remember is that the entire thinking around making this different from the
old was *only* because it seemed that using a simpler api would perform much worse than this
one, and it does not appear that this is the case.  If that original reasoning turns out to
have been incorrect, then the answer is simple: go with the simpler API *now* before users
*do* get used to using the new one.

If it turns out I'm wrong, and lots of users sort based on field values for the top 1000 entries
often, or that the most recent runs turn out to be flukes and are not typical performance,
only then would I'd change my opinion.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
>                 Key: LUCENE-1997
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1997
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
>     segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
>     random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message