lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Delbru (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2905) Sep codec writes insane amounts of skip data
Date Fri, 04 Feb 2011 17:54:30 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990649#comment-12990649
] 

Renaud Delbru edited comment on LUCENE-2905 at 2/4/11 5:53 PM:
---------------------------------------------------------------

Hi Robert,

it is good to see we are going in the same direction ;o). Here is a short paper about an extension
of skip list for block based inverted file [1] which was accepted at the European Conference
of Information Retrieval. I shared this paper in a previous discussion with Michael. Maybe
you will find some ideas inside that are worth keeping in mind.

Also, the main problem of the current skip list implementation when it is applied on Sep codec
is in my opinion the fact that we have to store for each skip list entry the fp of each of
the sep file. And more you have files (in the SIREn case, we had 5 files), more the skip list
data structure get bigger.
However, if you think of it, the main goal of the skip list is to skip doc id, not freq or
pos. We are storing the fps of the freq and pos files only because we need to synchronise
the position of the "inverted file pointer" on each file. 
Also, this occurs some overhead when you only need to answer
- pure boolean query: we only need to scan the doc file, but we are still reading and decoding
the fp pointers of the freq and pos file;
- extended boolean query: we only need to scan the doc and freq file, but we are still reading
and decoding the fp pointers of the pos file;

An idea I had a few months ago (but never found the time to implement it and test it) was
to change the way the skip list data structure is created. The idea was to store the pointer
to the doc file in the skip entry, and nothing else. The other pointers (to the freq file
and position file) are in fact stored into the block header.
When using the skip list, you will traverse the skip list until you find the skip point of
interest, then decode the associated skip entry and get the doc file pointer. The doc file
pointer indicates the beginning of the block that contains the identifiers that you are looking
for.
After reading the block from the disk and load it into memory, you can decode its header,
which contains a pointer to the associated block into the frequency file. Similarly, into
the block header of the frequency file, you have a pointer to the associated block into the
pos file. In fact, you can picture this by a linked list of block. The skip list provides
only the first pointer to the block doc file, then pointers to subsequent blocks are included
into the block headers.
On one hand, this considerably reduce the size of the skip list, since most of the information
are "exported" and encoded into block headers. On the other hand, I am not sure if it reduces
the size of the index, as it just move the data from the skip list to the inverted file. In
addition, I think it makes impossible the delta-encoding of the fp for the freq and pos file.
But they might be other optimisation possible with this data model.

[1] http://dl.dropbox.com/u/1278798/ecir2011-skipblock.pdf

      was (Author: renaud.delbru):
    Hi Robert,

it is good to see we are going in the same direction ;o). Here is a short paper about an extension
of skip list for block based inverted file [1] which was accepted at the European Conference
of Information Retrieval. I shared this paper in a previous discussion with Michael. Maybe
you will find some ideas inside that are worth keeping in mind.

Also, the main problem of the current skip list implementation when it is applied on Sep codec
is in my opinion the fact that we have to store for each skip list entry the fp of each of
the sep file. And more you have files (in the SIREn case, we had 5 files), more the skip list
data structure get bigger.
However, if you think of it, the main goal of the skip list is to skip doc id, not freq or
pos. We are storing the fps of the freq and pos files only because we need to synchronise
the position of the "inverted file pointer" on each file. 
Also, this occurs some overhead when you only need to answer
- pure boolean query: we only need to scan the doc file, but we are still reading and decoding
the fp pointers of the freq and pos file;
- extended boolean query: we only need to scan the doc and freq file, but we are still reading
and decoding the fp pointers of the pos file;
An idea I had a few months ago (but never found the time to implement it and test it) was
to change the way the skip list data structure is created. The idea was to store the pointer
to the doc file in the skip entry, and nothing else. The other pointers (to the freq file
and position file) are in fact stored into the block header.
When using the skip list, you will traverse the skip list until you find the skip point of
interest, then decode the associated skip entry and get the doc file pointer. The doc file
pointer indicates the beginning of the block that contains the identifiers that you are looking
for.
After reading the block from the disk and load it into memory, you can decode its header,
which contains a pointer to the associated block into the frequency file. Similarly, into
the block header of the frequency file, you have a pointer to the associated block into the
pos file. In fact, you can picture this by a linked list of block. The skip list provides
only the first pointer to the block doc file, then pointers to subsequent blocks are included
into the block headers.
On one hand, this considerably reduce the size of the skip list, since most of the information
are "exported" and encoded into block headers. On the other hand, I am not sure if it reduces
the size of the index, as it just move the data from the skip list to the inverted file. In
addition, I think it makes impossible the delta-encoding of the fp for the freq and pos file.
But they might be other optimisation possible with this data model.

[1] http://dl.dropbox.com/u/1278798/ecir2011-skipblock.pdf
  
> Sep codec writes insane amounts of skip data
> --------------------------------------------
>
>                 Key: LUCENE-2905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2905
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: Bulk Postings branch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable Intblock
> encodings, we have problems with both performance and index size versus StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset into the block.
> * In a lot of cases various numbers involved are larger than they should be: e.g. they
are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, Interleaved? that
interleaves doc and freq blocks (doc,freq,doc,freq) into one file? the concrete impl could
implement skipBlock() for when they only want docdeltas: e.g. for Simple64 blocks on disk
are fixed size so it could just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint
just read their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such as 64 or
128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but instead
completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to the start
of a block? Then, their skip pointers dont need to be a combination of delta + upto, because
upto is always zero. What would we have to modify in the bulkpostings api for jump() to work
with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based on block
size (maybe by default its 1, meaning we can skip to the start of a every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use 'blockid' instead
of file pointer so that we get smaller numbers? e.g. simple64 can do blockid * 8 to get to
the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be blockid deltas
from the index term, instead of fp deltas? This would be smaller numbers and we could compress
this metadata better.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message