lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
Date Sat, 08 Oct 2011 09:32:29 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1536:
----------------------------------

    Attachment: changes-yonik-uwe.patch
                LUCENE-1536.patch
                LUCENE-1536-rewrite.patch

Attached you will find a new patch LUCENE-1536.patch, incorporating Yonik's changes plus some
minor improvements:
- changed Javadocs of DIS.bits() to explain what you should do/not do.
- Added another early exit condition in FilteredQuery#Weight.scorer(): As we already get the
first matching doc of the filter iterator before looking at bits or creating the query scorer,
we should erly exit, if the first matching doc is Disi.NO_MORE_DOCS. This   saves us from
creating the Query Scorer.
- I removed Robert's safety TODO in SolrIndexSearcher. It no longer disabled random access
completely. After Yoniks changes, all places in Solr that are not random access secure are
disabled - e.g. SolrIndexSearcher.FilterImpl (not sure what this class does, maybe it should
also implement bits()?) - we should do that in a Solr specific optimization issue.

Some other cool thing with filters is ANDing filters without ChainedFilter (this approach
is is very effective with random access as it does not allocate additional BitsSet). If you
want to AND together several filters and apply them to a Query, do the following:

{code:java}
IS.search(new FilteredQuery(query,filter2), filter1,...);
{code}

You can chain even more filters in by adding more FilteredQueries. What this does:
IS will automatically create another FilteredQuery to apply the filter and get the Weight
of the top-level FilteredQuery. The scorer of this one will be top-level, get the filter and
if it is random access, it will execute the filter with acceptDocs==liveDocs. The result bits
of this filter will be passed to Weight.scorer of the second FilteredQuery as acceptDocs.
This one will pass the acceptDocs (which are already filtered) to its Filter and if again
random access pass those as acceptDocs to the inner Query's scorer. Finally the top-level
IS will execute scorer.score(Collector), which in fact is the inner Query's scorer (no wrappers!)
with all filtering applied in acceptDocs. This is incredible cool :-)

One thing about large patches in an issue:
If you are working on an issue and have you local changes in your checkout and posted a patch
to an issue and somebody else, posted an updated patch to an issue, it is often nice to see
the diff between those patches. I wanted to see what Yonik changed, but a 140 K patch is not
easy to handle. The trick is "interdiff" from patchutils package: You can call "interdiff
LUCENE-1536-original.patch LUCENE-1536-yonik.patch" and you get a patch of only changes applied
by Yonik. This patch can even be applied to your local already patched checkout.

The changes-yonik-uwe.patch was generated that way and shows, what changes I did in my last
patch in contrast to Yoniks original.
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch,
LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch,
LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message