lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-866) Multi-level skipping on posting lists
Date Sat, 21 Apr 2007 02:23:15 GMT


Michael Busch commented on LUCENE-866:

I ran a few performance tests with this patch. I built a rather big 
index (about 1.2GB) using documents from Wikipedia and optimized the
index to get big posting lists.

I compare query evaluation time with one-level skipping (old) and
multi-level skipping (new) and measure the amount of I/O by counting 
the number of VInts read from disk. Each query runs several thousand

The following queries contain very frequent and very unique terms. 
For these queries the speedup with multi-level skipping is 

   Query: +lucene +search +engine +library
   5 total matching documents
   VInt reads: old: 143268000, new: 3933000, -97.25479520897898%
   Time: old: 7234ms, new: 1157ms, -84.00608238871993%

   Query: +apache +http +server
   181 total matching documents
   VInt reads: old: 155892000, new: 27849000, -82.13570933723346%
   Time: old: 10656ms, new: 5703ms, -46.48085585585586%

Even though I/O is reduced for the next query, it runs a bit slower.
I believe the reason is that the same query runs several thousand
times, so the posting lists will be loaded into the file system
cache and the effect of less I/O is reduced, while the skipping
algorithm itself is a bit more complex:

   Query: +multi +level +skipping
   13 total matching documents
   VInt reads: old: 42894000, new: 39096000, -8.854385228703315%
   Time: old: 3875ms, new: 3922ms, 1.2129032258064516%

For the next query there is slightly more I/O necessary in the 
multi-skipping case. This is because not many big skips can be made,
but more skipping data has to be read. The top 2 skip levels are 
buffered in this first version of the patch and if no big skips
can be made than this buffering is overhead compared to the 
single-level case:

   Query: +beatles +yellow +submarine
   78 total matching documents
   VInt reads: old: 38460000, new: 38685000, 0.5850234009360374%
   Time: old: 3172ms, new: 3265ms, 2.9319041614123584%

However, if I change the query a little bit, then the speed-up
is significant (due to the very frequent stop word "the"):

   Query: +"the beatles" +yellow +submarine
   77 total matching documents
   VInt reads: old: 382307000, new: 262331000, -31.38210914265237%
   Time: old: 26703ms, new: 22828ms, -14.51147811107366%
It would be interesting to run more sophisticated benchmarks.
To run the same query several times is not very realistic 
because it reduces the effects of the I/O savings due to caching.

I'm not that familiar with the new benchmark stuff that has
been added recently, but I'll try to dig into that next week.

> Multi-level skipping on posting lists
> -------------------------------------
>                 Key: LUCENE-866
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>            Priority: Minor
>         Attachments: lucene-866.patch
> To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip lists. 
> The default skip interval is set to 16. If we want to skip e. g. 100 documents, 
> then it is not necessary to read 100 entries from the posting list, but only 
> 100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. This 
> speeds up conjunction (AND) and phrase queries significantly.
> However, the skip interval is always a compromise. If you have a very big index 
> with huge posting lists and you want to skip over lets say 100k documents, then 
> it is still necessary to read 100k/16 = 6250 entries from the skip list. For big 
> indexes the skip interval could be set to a higher value, but then after a big 
> skip a long scan to the target doc might be necessary.
> A solution for this compromise is to have multi-level skip lists that guarantee a 
> logarithmic amount of skips to any target in the posting list. This patch 
> implements such an approach in the following way:
>   Example for skipInterval = 3:
>                                                       c            (skip level 2)
>                   c                 c                 c            (skip level 1) 
>       x     x     x     x     x     x     x     x     x     x      (skip level 0)
>   d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d  (posting list)
>       3     6     9     12    15    18    21    24    27    30     (df)
>   d - document
>   x - skip data
>   c - skip data with child pointer
> Skip level i contains every skipInterval-th entry from skip level i-1. Therefore the

> number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).
> Each skip entry on a level i>0 contains a pointer to the corresponding skip entry
> list i-1. This guarantees a logarithmic amount of skips to find the target document.
> Implementations details:
>    * I factored the skipping code out of SegmentMerger and SegmentTermDocs to 
>      simplify those classes. The two new classes AbstractSkipListReader and 
> 	 AbstractSkipListWriter implement the skipping functionality.
>    * While AbstractSkipListReader and Writer take care of writing and reading the 
>      multiple skip levels, they do not implement an actual skip data format. The two

> 	 new subclasses DefaultSkipListReader and Writer implement the skip data format 
> 	 that is currently used in Lucene (with two file pointers for the freq and prox 
> 	 file and with payload length information). I added this extra layer to be 
> 	 prepared for flexible indexing and different posting list formats. 
> File format changes: 
>    * I added the new parameter 'maxSkipLevels' to the term dictionary and increased the
>      version of this file. If maxSkipLevels is set to one, then the format of the freq

> 	 file does not change at all, because we only have one skip level as before. For 
> 	 backwards compatibility maxSkipLevels is set to one automatically if an index 
> 	 without the new parameter is read. 
>    * In case maxSkipLevels > 1, then the frq file changes as follows:
>      FreqFile (.frq) --> <TermFreqs, SkipData>^TermCount
> 	 SkipData        --> <<SkipLevelLength, SkipLevel>^(Min(maxSkipLevels, 
> 	                       floor(log(DocFreq/log(skipInterval))) - 1)>, SkipLevel>
> 	 SkipLevel       --> <SkipDatum>^DocFreq/(SkipInterval^(Level + 1))
> 	 Remark: The length of the SkipLevel is not stored for level 0, because 1) it is not

> 	 needed, and 2) the format of this file does not change for maxSkipLevels=1 then.
> All unit tests pass with this patch.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message