lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
Date Fri, 01 May 2009 15:00:31 GMT


Michael McCandless commented on LUCENE-1536:

I don't think it should be the caller's job to getSequentialSubScorers
and push down a RAF?  Rather, I think when requesting a scorer we
should pass in a RAF, requiring that the returned scorer factors it
in (passing on to its own sub-scorers if needed).

bq. One thing we could do is require a concrete impl (eg OpenBitSet) in order to use this

This may be too restrictive, because another use case (touched on
already in LUCENE-1593, but actually much more similar to this issue)
is when sorting by field.

EG say we are sorting by int ascending, and from the
FieldValueHitQueue we know the bottom value is 17.  Then, within
TermScorer if we see a docID we should check if its value is greater
than 17 and skip it if so.

Very likely this will be a sizable performance gain, but it would be a
major change because you can only do this if you do not need the
precise totalHits back.

So... maybe we need to allow an abstract RandomAccesDocIdSet, to allow
this use case.  But perhaps we should negate its API?  Ie it exposes
"boolean reject(int docID)".

> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>                 Key: LUCENE-1536
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message