lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
Date Thu, 13 Oct 2011 10:17:12 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1536:
----------------------------------

    Attachment: LUCENE-1536.patch

Patch that for now (until BooleanWeight is fixed) rewuires in-order scoring when a filter
is applied, as suggested by Robert.

There was also a silly bug in CachingWrapperFilter that made it not to apply acceptDocs on
cached results. This may also be a reason for result differences.

Mike: "Normal Filters" always respect acceptDocs, as they generally use the acceptDocs to
pass them to IndexReader. E.g. MTQWF or all other filters. Only some special filters e.g.
working on FieldCache have to respect acceptDocs. So there is no slowdown at all, it all exactly
as before (pre-acceptDocs Filters always used getLiveDocs() - although they were not required
to do so). The silly thing was that we applied accept docs simply at too many places, so we
decided to make the filters do it themselves (as they can, because in previous patch they
always called IR.getLiveDocs(); except FiledCacheFilters).

The optimizations for CachingWrapperFilter to also cache acceptDocs should maybe done in a
separate issue. We have currently no slowdown, as its still faster than before. Improvements
can come later.

I agree, we can make the pair (IR, acceptDocs) a key into the cache map, the only problem
is that accpetDocs are not required to implement equals/hashCode - and if they do, like in
FixedBitSet, it's slow. So the cache should do this using systemHashCode/IdentityHashMap-like
algorithm (so only compare the Bits pointer, not contents).
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch,
LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch,
LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message