lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
Date Wed, 04 Feb 2009 20:37:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670452#action_12670452
] 

Michael McCandless commented on LUCENE-1536:
--------------------------------------------

It's a ridiculous amount of data to digest, but here are some initial
observations/thoughts:

  * There are very sizable gains here by switching to random-access
    low (ie, handling top-level filter the way we now handle deletes).
    I'm especially interested in gains in the slowest queries.  EG the
    phrase query "united states" sees QPS gains from 16%-69%.  The
    10-clause OR query 1-10 sees QPS gains between 8% and 130%.

  * Results are consistent with LUCENE-1476: random-access low gives
    the best performance when filter density is >= 1%.

  * High is worse than trunk up until ~25% density, which makes sense
    since we are asking Scorer to do alot of work producing docIDs
    that we then nix with the filter.

  * Low is consistently better than high, though as filter density
    gets higher the gap between them narrows.  I'll drop high from
    future tests.

  * The gains are generally strongest in the "moderate" density range,
    5-25%.

  * The degenerate 0% case is clearly far far worse, which is expected
    since the iterator scans the bits, finds none set, and quickly
    ends the search.  For very low density filters we should continue
    to use iterator.

  * The "control" 100% case (where filter is null) is about the same,
    which is expected.



> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message