lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Delbru (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2340) FixedIntBlockIndexOutput encodes unnecessary integers at the end of a list
Date Tue, 23 Mar 2010 01:07:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848484#action_12848484
] 

Renaud Delbru commented on LUCENE-2340:
---------------------------------------

{quote}
The block is "shared" across postings, so a rare posting list in an otherwise big segment
should be fine?
{quote}

Ok, I didn't got that point until now. Small lists will be inlined with bigger lists over
a block.
However, it means that, most of the time after a seek, you will have to decode a block when
only a part of it is of interest (e.g., decode the block, skip the end of the previous posting
list, in order to access the start of the posting list you are looking for). But this overhead
stays minimal since it will occur only once per query term times the number of posting files.
And this overhead is even reduced more if the block compression algorithm is easily "seekable"
like for FOR.
However, I am not sure about the consequences on algorithms like FOR or PFOR which determine
the best compression configuration for a block given its list of integers. Certain blocks
could be less well compressed since its list of integers could follow more than one distribution.
It could be something interesting to try (or are you aware of any kind of experiments or benchmarks
that show the benefits/disadvantages of these two approaches ?).

{quote}
Small segments will indeed be wasteful, but they'll presumably quickly be merged away.
{quote}

Yes, they will probably disappear after the first merge.

{quote}
Would other less-silly impls also need to do this? Ie the thing I want to avoid is foisting
onto all block-based codecs the need to track the size of every block...
{quote}

They would have to do something similar, but a clever implementation could reduce well the
overhead.

For example, the core algorithm of Simple9, FOR, PFOR are completely compatible. The only
changes I had to made is in *IndexInput#readBlock and *IndexOutput#flushBlock.
I set the lower bit of numBytes to 0 when the block to write does not have the same size than
the previous block. and encode its length (in IndexOutput). Then, in IndexInput, whenever
I read a numBytes with the lower bit to 0, I read the length of the block, and pass it to
#setUnCompressedData. Since current block algorithms relies on the size of the uncompressed
data to terminate their loop, they adapt automatically to the new block size.

But, indeed, after your comments, the gain seems very minimal, since an incomplete block will
only happen once per segment file. The technique I am using was for the case where a block
is not "shared" with other posting list.
But, in the current case, this is not very useful and will add a slight overhead for very
large segment file (since I need one bit per block * number of blocks for tracking the end
of the block list).

Thanks for the clarification Michael, and I think you can close the issue, since I am not
really sure that my patch is worth keeping given the current way you implemented block-based
index.

> FixedIntBlockIndexOutput encodes unnecessary integers at the end of a list
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-2340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2340
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: Flex Branch
>            Reporter: Renaud Delbru
>            Priority: Minor
>             Fix For: Flex Branch
>
>         Attachments: LUCENE-1458-FixedIntBlockIndexOutput.patch, LUCENE-1458-FixedIntBlockIndexOutput.patch
>
>
> At closing time, the current FixedIntBlockIndexOutput flushes blocks of blockSize even
if there is only a few integers in the block.
> This can be problematic and causes a big overhead when using large blockSize (e.g., 1024),
on small segments or on rare term posting list. 
> One solution will be to have a secondary flushBlock method with an additional paramter:
the valid length of a buffer. This method will be only called in the FixedIntBlockIndexOutput#close()
method.
> The way this particular block of integers are encoded are left to subclasses.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message