Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <3704965.1180129456282.JavaMail.jira@brutus>
Date: Fri, 25 May 2007 14:44:16 -0700 (PDT)
From: "Michael Busch (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-866) Multi-level skipping on posting
 lists
In-Reply-To: <10098370.1177119675313.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499236 ] 

Michael Busch commented on LUCENE-866:
--------------------------------------

I ran some more performance experiments to test the average speedup.
I used the same index (about 1.2GB, optimized, docs from wikipedia)
and created 50,000 random queries. Each query has 3 AND terms and 
each term has a df > 100. 

Here are the results:

old (one level skipping):
Time: 62141 ms.
VInt reads: 752430441

new (multi level):
Time: 51969 ms.
VInt reads: 435504734

This is a speedup of about 16% and i/o savings of 42%.

Then I changed the algorithm a bit to not always start on the 
highest skip level to find the next skip but to start at level 0
and walk up until the next skip point is greater than the target.
This speeds up things further so that the overall gain is about
20%:

Time: 49500 ms.
Bytes read: 435504734

This 20% speedup is for average AND queries. Certain queries
benefit even more from multi-level skipping as the results
show that I attached earlier.


I will attach a new patch as soon as LUCENE-888 is committed.

> Multi-level skipping on posting lists
> -------------------------------------
>
>                 Key: LUCENE-866
>                 URL: https://issues.apache.org/jira/browse/LUCENE-866
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: lucene-866.patch
>
>
> To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip lists. 
> The default skip interval is set to 16. If we want to skip e. g. 100 documents, 
> then it is not necessary to read 100 entries from the posting list, but only 
> 100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. This 
> speeds up conjunction (AND) and phrase queries significantly.
> However, the skip interval is always a compromise. If you have a very big index 
> with huge posting lists and you want to skip over lets say 100k documents, then 
> it is still necessary to read 100k/16 = 6250 entries from the skip list. For big 
> indexes the skip interval could be set to a higher value, but then after a big 
> skip a long scan to the target doc might be necessary.
> A solution for this compromise is to have multi-level skip lists that guarantee a 
> logarithmic amount of skips to any target in the posting list. This patch 
> implements such an approach in the following way:
>   Example for skipInterval = 3:
>                                                       c            (skip level 2)
>                   c                 c                 c            (skip level 1) 
>       x     x     x     x     x     x     x     x     x     x      (skip level 0)
>   d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d  (posting list)
>       3     6     9     12    15    18    21    24    27    30     (df)
>  
>   d - document
>   x - skip data
>   c - skip data with child pointer
>  
> Skip level i contains every skipInterval-th entry from skip level i-1. Therefore the 
> number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).
>  
> Each skip entry on a level i>0 contains a pointer to the corresponding skip entry in 
> list i-1. This guarantees a logarithmic amount of skips to find the target document.
> Implementations details:
>    * I factored the skipping code out of SegmentMerger and SegmentTermDocs to 
>      simplify those classes. The two new classes AbstractSkipListReader and 
> 	 AbstractSkipListWriter implement the skipping functionality.
>    * While AbstractSkipListReader and Writer take care of writing and reading the 
>      multiple skip levels, they do not implement an actual skip data format. The two 
> 	 new subclasses DefaultSkipListReader and Writer implement the skip data format 
> 	 that is currently used in Lucene (with two file pointers for the freq and prox 
> 	 file and with payload length information). I added this extra layer to be 
> 	 prepared for flexible indexing and different posting list formats. 
>       
>    
> File format changes: 
>    * I added the new parameter 'maxSkipLevels' to the term dictionary and increased the
>      version of this file. If maxSkipLevels is set to one, then the format of the freq 
> 	 file does not change at all, because we only have one skip level as before. For 
> 	 backwards compatibility maxSkipLevels is set to one automatically if an index 
> 	 without the new parameter is read. 
>    * In case maxSkipLevels > 1, then the frq file changes as follows:
>      FreqFile (.frq) --> <TermFreqs, SkipData>^TermCount
> 	 SkipData        --> <<SkipLevelLength, SkipLevel>^(Min(maxSkipLevels, 
> 	                       floor(log(DocFreq/log(skipInterval))) - 1)>, SkipLevel>
> 	 SkipLevel       --> <SkipDatum>^DocFreq/(SkipInterval^(Level + 1))
> 	 Remark: The length of the SkipLevel is not stored for level 0, because 1) it is not 
> 	 needed, and 2) the format of this file does not change for maxSkipLevels=1 then.
> 	 
> 	 
> All unit tests pass with this patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org