Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Tue, 11 Oct 2011 12:55:12 +0000 (UTC)
From: "Uwe Schindler (Commented) (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <14002452.550.1318337712452.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (LUCENE-1536) if a filter can support random
 access API, we should use it
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124993#comment-13124993 ] 

Uwe Schindler commented on LUCENE-1536:
---------------------------------------

Sorry Robert, if that works its fine, but test is still failing, so something is wrong.

My problem with the patch here is more, that for most filters, the call to getDocIdSet() is the most expensive one. So you are right with caching the result. But we actually calculate the DocIdSet twice (unless we use CachingWrapperFilter), if the acceptDocs != liveDocs. And as the first segment is generally the largest one, this is even worse.

In my opinion, the whole approach of looking into the sparseness of the DocIdSet is broken for this case, as we can correctly do this only per segment, but we later require all segments to use the same scorer implementation. I have no idea, how to solve this. It would not even be enough like Chris/Mikes orginal approaches to support something like DocIdSet.useBits()/isSparse() whatever, as this is also by segment.

There is also a second problem: It might happen that one filter returns a DocIdSet that does not support bits() for one segment, but another one for other segments? How to handle that? There is one case where this happens (DocIdSet.EMPTY_DOCIDSET always returns null for bits) - but this one is grafefully handled by an early exit condition, so we won't get NPE.

The only possible solution is to make Filters always request in-order scoring, but this would limit our optimization possibilities.

Finally I still think we should fix BS1 and BS2 to return identical scores (and write a test for that which compares scores). Second, in Mike's document/score listing above with/wo patch, I see no score differences, only order of docs is different (which is caused by out-of-order missing), so where is the problem?
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org