Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 14588 invoked from network); 9 Feb 2009 20:52:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Feb 2009 20:52:27 -0000 Received: (qmail 64459 invoked by uid 500); 9 Feb 2009 20:52:25 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 64277 invoked by uid 500); 9 Feb 2009 20:52:24 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 63959 invoked by uid 99); 9 Feb 2009 20:52:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Feb 2009 12:52:23 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Feb 2009 20:52:22 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4A199234C4B7 for ; Mon, 9 Feb 2009 12:52:02 -0800 (PST) Message-ID: <2090999791.1234212722302.JavaMail.jira@brutus> Date: Mon, 9 Feb 2009 12:52:02 -0800 (PST) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it In-Reply-To: <1491949348.1233779399612.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672005#action_12672005 ] Michael McCandless commented on LUCENE-1536: -------------------------------------------- OK I tested a different approach for matching when the filter is relatively sparse filter (<= 10% of index size), by implementing a simplistic prototype "random access" scorer API (JumpScorer), which only exposes the method "boolean jump(int docID)" that returns true if that doc matches, else false (and no next() under the hood). I only implemented JumpScorer for pure AND/OR queries (ie no excluded terms), so I only test for these queries. I convert the filter to SortedVIntList up front, so iteration is fast. Then during matching I iterate through each doc in the filter, and ask the random-access scoring API to test whether it accepts the doc, and collect it if so. This only performs better when the filter is sparse relative to the query. Once we merge Query/Filter, I think this optimization can more generally be used whenever one sub-query is very restrictive compared to the rest of the sub-queries. This is basically the reverse of what I first tested, which was to take a filter that can support random access API and "distribute" it down to each TermQuery, which gives very good gains especially when filter is in the middle of the sparse/dense range. Whereas this test keeps the iterator API on the filter, but switches to a random access API on the scorer. Results: ||%tg Filter||Query||Hits||QPS||QPSNew||%tg change|| |1%|1-2| 5363| 47.4| 66.7| 40.7%| |2%|1-2| 10675| 37.6| 50.6| 34.6%| |5%|1-2| 26880| 28.6| 37.0| 29.4%| |10%|1-2| 53673| 23.8| 26.2| 10.1%| |1%|1-4| 6544| 26.9| 37.2| 38.3%| |2%|1-4| 13062| 21.2| 29.2| 37.7%| |5%|1-4| 32815| 16.1| 21.2| 31.7%| |10%|1-4| 65491| 13.2| 15.2| 15.2%| |1%|1-10| 8406| 13.0| 17.6| 35.4%| |2%|1-10| 16756| 10.2| 14.2| 39.2%| |5%|1-10| 41937| 7.7| 10.3| 33.8%| |10%|1-10| 83828| 6.3| 7.8| 23.8%| |1%|+1-2| 2308| 63.6| 82.9| 30.3%| |2%|+1-2| 4621| 49.9| 60.7| 21.6%| |5%|+1-2| 11706| 35.8| 47.6| 33.0%| |10%|+1-2| 23272| 28.3| 35.5| 25.4%| |1%|+1-4| 923| 34.4| 58.0| 68.6%| |2%|+1-4| 1849| 28.5| 44.9| 57.5%| |5%|+1-4| 4794| 22.0| 33.7| 53.2%| |10%|+1-4| 9595| 19.8| 25.4| 28.3%| |1%|+1-10| 292| 17.0| 36.6|115.3%| |2%|+1-10| 579| 15.2| 30.2| 98.7%| |5%|+1-10| 1517| 13.5| 22.2| 64.4%| |10%|+1-10| 2999| 12.4| 17.4| 40.3%| > if a filter can support random access API, we should use it > ----------------------------------------------------------- > > Key: LUCENE-1536 > URL: https://issues.apache.org/jira/browse/LUCENE-1536 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: 2.4 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-1536.patch > > > I ran some performance tests, comparing applying a filter via > random-access API instead of current trunk's iterator API. > This was inspired by LUCENE-1476, where we realized deletions should > really be implemented just like a filter, but then in testing found > that switching deletions to iterator was a very sizable performance > hit. > Some notes on the test: > * Index is first 2M docs of Wikipedia. Test machine is Mac OS X > 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. > * I test across multiple queries. 1-X means an OR query, eg 1-4 > means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 > AND 3 AND 4. "u s" means "united states" (phrase search). > * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, > 95, 98, 99, 99.99999 (filter is non-null but all bits are set), > 100 (filter=null, control)). > * Method high means I use random-access filter API in > IndexSearcher's main loop. Method low means I use random-access > filter API down in SegmentTermDocs (just like deleted docs > today). > * Baseline (QPS) is current trunk, where filter is applied as iterator up > "high" (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org