lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2905) Sep codec writes insane amounts of skip data
Date Mon, 14 Feb 2011 00:13:57 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-2905:
--------------------------------

    Attachment: LUCENE-2905_interleaved.patch

attached is an initial stab at an interleaved block layout for fixed int block coders.

its limited, and optimized for the case where doc,freq,pos have the same blocksize...

the existing fixed int block api is not really changed, only extended. so you can implement
this api and easily choose between Sep or Fixed index layout.

the docs and freq blocks are interleaved (doc block, freq block) in such a way that its generally
transparent to bulk enum consumers. So they work in lock-step, on the same actual underlying
index input (sharing io buffer, etc).

pointers in the term dictionary and skipdata are reduced, because they only point to the block/offset
in the single .doc file, and the freq is implied and parallel following it.

i added skipBlock() which for these fixed encoders, means they only read their header (typically
a single vint) and skip their compressed payload. Because of this, i increased skip interval
(to blocksize actually) in the patch. 

This might seem scary, but its not so bad, because without using any skip data, we can skip
over the blocks themselves... e.g. skipping over 16,384 pending positions is only 128 vint
reads.

for now, i only cutover BulkVInt for testing, here's what i have so far:
The total index size for the wikipedia benchmark index decreased from to 4,936,834,246 to
4,568,172,525 bytes (note this was already recently reduced from 5,180,088,695 bytes).

Here is the perf results:
||Query||QPS branch||QPS patch||Pct diff||||
|unit~0.7|29.17|28.34|{color:red}-2.9%{color}|
|unit*|30.28|29.65|{color:red}-2.1%{color}|
|uni*|17.56|17.20|{color:red}-2.0%{color}|
|unit~0.5|17.89|17.54|{color:red}-1.9%{color}|
|doctitle:.*[Uu]nited.*|4.02|3.97|{color:red}-1.3%{color}|
|un*d|17.43|17.28|{color:red}-0.9%{color}|
|u*d|8.97|8.92|{color:red}-0.5%{color}|
|doctimesecnum:[10000 TO 60000]|10.88|10.92|{color:green}0.3%{color}|
|+united +states|13.98|14.04|{color:green}0.5%{color}|
|united~0.6|8.19|8.24|{color:green}0.6%{color}|
|"united states"|9.00|9.13|{color:green}1.4%{color}|
|states|37.75|38.72|{color:green}2.6%{color}|
|united~0.75|11.45|11.75|{color:green}2.6%{color}|
|+nebraska +states|88.60|92.81|{color:green}4.7%{color}|
|united states|11.71|12.80|{color:green}9.3%{color}|
|"united states"~3|4.88|5.41|{color:green}10.7%{color}|
|spanFirst(unit, 5)|166.83|198.12|{color:green}18.8%{color}|
|spanNear([unit, state], 10, true)|36.10|47.48|{color:green}31.5%{color}|

the patch is really ugly, the whole thing is basically a nocommit, but in general the tests
pass (i have at least one long-tail random fail to hunt down), but i think it has some promise.

i mostly only added the capabilities to the old docsEnums and docsAndPositionsEnums, so we
dont see all the benefits i think... ideally we can tweak the bulkpostings apis (especially
positions) to take advantage of some of this stuff.




> Sep codec writes insane amounts of skip data
> --------------------------------------------
>
>                 Key: LUCENE-2905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2905
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: Bulk Postings branch
>
>         Attachments: LUCENE-2905_intblock.patch, LUCENE-2905_interleaved.patch, LUCENE-2905_simple64.patch,
LUCENE-2905_skipIntervalMin.patch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable Intblock
> encodings, we have problems with both performance and index size versus StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset into the block.
> * In a lot of cases various numbers involved are larger than they should be: e.g. they
are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, Interleaved? that
interleaves doc and freq blocks (doc,freq,doc,freq) into one file? the concrete impl could
implement skipBlock() for when they only want docdeltas: e.g. for Simple64 blocks on disk
are fixed size so it could just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint
just read their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such as 64 or
128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but instead
completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to the start
of a block? Then, their skip pointers dont need to be a combination of delta + upto, because
upto is always zero. What would we have to modify in the bulkpostings api for jump() to work
with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based on block
size (maybe by default its 1, meaning we can skip to the start of a every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use 'blockid' instead
of file pointer so that we get smaller numbers? e.g. simple64 can do blockid * 8 to get to
the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be blockid deltas
from the index term, instead of fp deltas? This would be smaller numbers and we could compress
this metadata better.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message