lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
Date Sat, 06 Nov 2010 10:30:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928948#action_12928948
] 

Michael McCandless commented on LUCENE-1536:
--------------------------------------------

bq. Wondering what are your thoughts on fixing filters correctly are?

I think the approach you outlined is the right one!

We already have the APIs in flex (Bits interface for random access, postings APIs take a Bits
skipDocs); in backporting to 3.x I think we'd just port Bits back.

There are some challenges though:

  * We should add a method to Filter to ask it if its already folded in deleted docs or not.
 So eg if a Filter is random access but doesn't factor in del docs we'd have to wrap it so
that every random access check also checks del docs ("AND NOT deleted.get(docID)").

  * We need a coarse heuristic in IndexSearcher to decide when a filter "merits" down low
application.  Ie, even if a filter is random access, if it's rather sparse (< 1% or 2%
or something) it's better to apply it the way we do today ("up high").  In the current patch
it's too coarse (it's either globally on or off); it should be based on the filter instead,
or maybe the filter provides a method and that method defaults to the 1/2% threshold check.

  * I suspect we should invert the "Bits skipDocs" now passed to the flex APIs, to be "Bits
acceptDocs" instead, so that we don't have to invert every filter.  This'd also mean changing
IndexReader.getDeletedDocs to IndexReader.getNotDeleteDocs.

Then I think we simply pass the Bits filter into the Weight.scorer API.

{quote}
I think that any type of solution should support the great feature of Lucene queries, for
example, FilteredQuery should use that, allowing to build complex query expressions without
having the mentioned optimization only applied on the top level search.
{quote}
Good point -- FilteredQuery should use this same low level API if its filter is random access
and "dense enough".

{quote}
As most filters results do support random access, either because they use OpenBitSet, or because
they are built on top of FieldCache functionality, I think this feature will give great speed
improvements to the query execution time.
{quote}

Right, the speed gains are often awesome!

> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message