lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4226) Efficient compression of small to medium stored fields
Date Wed, 26 Sep 2012 00:02:07 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-4226:
---------------------------------

    Attachment: LUCENE-4226.patch

New version of the patch. It contains a few enhancements:
 - Merge optimization: whenever possible the StoredFieldsFormat tries to copy compressed data
instead of uncompressing it into a buffer before compressing back to an index output,
 - New options for the stored fields index: there are 3 strategies that allow different memory/perf
trade-offs:
 ** leaving it fully on disk (same as Lucene40, relying on the O/S cache),
 ** loading the position of the start of the chunk for every document into memory (requires
up to 8 * numDocs bytes, no disk access),
 ** loading the position of the start of the chunk and the first doc ID it contains for every
chunk (requires up to 12 * numChunks bytes, no disk access, interesting if you have large
chunks of compressed data).
 - Improved memory usage and compression ratio (but a little slower) for CompressionMode.FAST
(using packed ints).
 - Try to save 1 byte per field by storing the field number and the bits together.
 - More tests.

So in the end, this StoredFieldsFormat tries to make disk seeks less likely by:
 - giving the ability to load the stored fields index into memory (you never need to seek
to find the position of the chunk that contains you document),
 - reducing the size of the fields data file (.fdt) so that the O/S cache can cache more documents.

Out of curiosity, I tested whether it could be faster for LZ4 to use intermediate buffers
for compression and/or uncompression, and it is slower than accessing the index input/output
directly (at least with MMapDirectory).

I hope I'll have something committable soon.
                
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
>                 Key: LUCENE-4226
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4226
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>             Fix For: 4.1, 5.0
>
>         Attachments: CompressionBenchmark.java, CompressionBenchmark.java, LUCENE-4226.patch,
LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common for an
index with stored fields enabled to have most of its space used by the .fdt index file. To
prevent this .fdt file from growing too much, one option is to compress stored fields. Although
compression works rather well for large fields, this is not the case for small fields and
the compression ratio can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a {{StoredFieldsFormat}}
that compresses several documents in a single chunk of data. To see how it behaves in terms
of document deserialization speed and compression ratio, I've run several tests with different
index compression strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and
text were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and chunks of
6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and chunks of
6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 docs.
> For those who don't know Snappy, it is compression algorithm from Google which has very
high compression ratios, but compresses and decompresses data very quickly.
> {noformat}
> Format           Compression ratio     IndexReader.document time
> ————————————————————————————————————————————————————————————————
> uncompressed     100%                  100%
> doc/deflate 1     59%                  616%
> doc/deflate 9     58%                  595%
> doc/snappy        80%                  129%
> index/deflate 1   49%                  966%
> index/deflate 9   46%                  938%
> index/snappy      65%                  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with deflate, the .fdt
file shrinks by a factor of 2, much better than with doc-level compression). One other interesting
thing is that {{index/snappy}} is almost as compact as {{doc/deflate}} while it is more than
2x faster at retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for compressed
fields (one can expect better results for formats that have a high compression ratio since
they probably require fewer read/write operations from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message