Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 85585 invoked from network); 25 May 2007 21:44:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 May 2007 21:44:40 -0000 Received: (qmail 50669 invoked by uid 500); 25 May 2007 21:44:42 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 50634 invoked by uid 500); 25 May 2007 21:44:42 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 50623 invoked by uid 99); 25 May 2007 21:44:42 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2007 14:44:42 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2007 14:44:36 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 45A2F714063 for ; Fri, 25 May 2007 14:44:16 -0700 (PDT) Message-ID: <3704965.1180129456282.JavaMail.jira@brutus> Date: Fri, 25 May 2007 14:44:16 -0700 (PDT) From: "Michael Busch (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-866) Multi-level skipping on posting lists In-Reply-To: <10098370.1177119675313.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499236 ] Michael Busch commented on LUCENE-866: -------------------------------------- I ran some more performance experiments to test the average speedup. I used the same index (about 1.2GB, optimized, docs from wikipedia) and created 50,000 random queries. Each query has 3 AND terms and each term has a df > 100. Here are the results: old (one level skipping): Time: 62141 ms. VInt reads: 752430441 new (multi level): Time: 51969 ms. VInt reads: 435504734 This is a speedup of about 16% and i/o savings of 42%. Then I changed the algorithm a bit to not always start on the highest skip level to find the next skip but to start at level 0 and walk up until the next skip point is greater than the target. This speeds up things further so that the overall gain is about 20%: Time: 49500 ms. Bytes read: 435504734 This 20% speedup is for average AND queries. Certain queries benefit even more from multi-level skipping as the results show that I attached earlier. I will attach a new patch as soon as LUCENE-888 is committed. > Multi-level skipping on posting lists > ------------------------------------- > > Key: LUCENE-866 > URL: https://issues.apache.org/jira/browse/LUCENE-866 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assigned To: Michael Busch > Priority: Minor > Fix For: 2.2 > > Attachments: lucene-866.patch > > > To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip lists. > The default skip interval is set to 16. If we want to skip e. g. 100 documents, > then it is not necessary to read 100 entries from the posting list, but only > 100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. This > speeds up conjunction (AND) and phrase queries significantly. > However, the skip interval is always a compromise. If you have a very big index > with huge posting lists and you want to skip over lets say 100k documents, then > it is still necessary to read 100k/16 = 6250 entries from the skip list. For big > indexes the skip interval could be set to a higher value, but then after a big > skip a long scan to the target doc might be necessary. > A solution for this compromise is to have multi-level skip lists that guarantee a > logarithmic amount of skips to any target in the posting list. This patch > implements such an approach in the following way: > Example for skipInterval = 3: > c (skip level 2) > c c c (skip level 1) > x x x x x x x x x x (skip level 0) > d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d (posting list) > 3 6 9 12 15 18 21 24 27 30 (df) > > d - document > x - skip data > c - skip data with child pointer > > Skip level i contains every skipInterval-th entry from skip level i-1. Therefore the > number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))). > > Each skip entry on a level i>0 contains a pointer to the corresponding skip entry in > list i-1. This guarantees a logarithmic amount of skips to find the target document. > Implementations details: > * I factored the skipping code out of SegmentMerger and SegmentTermDocs to > simplify those classes. The two new classes AbstractSkipListReader and > AbstractSkipListWriter implement the skipping functionality. > * While AbstractSkipListReader and Writer take care of writing and reading the > multiple skip levels, they do not implement an actual skip data format. The two > new subclasses DefaultSkipListReader and Writer implement the skip data format > that is currently used in Lucene (with two file pointers for the freq and prox > file and with payload length information). I added this extra layer to be > prepared for flexible indexing and different posting list formats. > > > File format changes: > * I added the new parameter 'maxSkipLevels' to the term dictionary and increased the > version of this file. If maxSkipLevels is set to one, then the format of the freq > file does not change at all, because we only have one skip level as before. For > backwards compatibility maxSkipLevels is set to one automatically if an index > without the new parameter is read. > * In case maxSkipLevels > 1, then the frq file changes as follows: > FreqFile (.frq) --> ^TermCount > SkipData --> <^(Min(maxSkipLevels, > floor(log(DocFreq/log(skipInterval))) - 1)>, SkipLevel> > SkipLevel --> ^DocFreq/(SkipInterval^(Level + 1)) > Remark: The length of the SkipLevel is not stored for level 0, because 1) it is not > needed, and 2) the format of this file does not change for maxSkipLevels=1 then. > > > All unit tests pass with this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org