lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4226) Efficient compression of small to medium stored fields
Date Wed, 29 Aug 2012 00:02:07 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-4226:
---------------------------------

    Attachment: CompressionBenchmark.java
                LUCENE-4226.patch

New patch as well as the code I used to benchmark.

Documents are still compressed into chunks, but I removed the ability to select the compression
algorithm on a per-field basis in order to make the patch simpler and to handle cross-field
compression.

I also added an index in front of compressed data using packed ints, so that uncompressors
can stop uncompressing when enough data has been uncompressed.

The JDK only includes a moderately fast compression algorithm (deflate), but for this kind
of use-case, we would probably be more interested in fast compression and uncompression algorithms
such as LZ4 (http://code.google.com/p/lz4/) or Snappy (http://code.google.com/p/snappy/).
Since lucene-core has no dependency, I ported LZ4 to Java (included in the patch, see o.a.l.util.compress).

LZ4 has a very fast uncompressor and two compression modes :
 - fast scan, which looks for the last offset in the stream that has at least 4 common bytes
(using a hash table) and adds a reference to it,
 - high compression, which looks for the last 256 offsets in the stream that have at least
4 common bytes, takes the one that has the longest common sequence, and then performs trade-offs
between overlapping matches in order to improve the compression ratio.

(In case you are curious about LZ4, I did some benchmarking with other compression algorithms
in http://blog.jpountz.net/post/28092106032/wow-lz4-is-fast, unfortunately the high-compression
Java impl is not included in the benchmark.)

I ran a similar benchmark as for my first patch, but this time I only compressed and stored
the 1kb text field (the title field being too small was unfair for document-level compression
with deflate). Here are the results :

{noformat}
Format           Chunk size  Compression ratio     IndexReader.document time
————————————————————————————————————————————————————————————————————————————
uncompressed                               100%                         100%
doc/deflate 1                               58%                         579%
doc/deflate 9                               57%                         577%
index/deflate 1          4K                 50%                        1057%
index/deflate 9          4K                 48%                        1037%
index/lz4 scan           4K                 70%                         329%
index/lz4 hc             4K                 66%                         321%
index/deflate 1           1                 60%                         457%
index/deflate 9           1                 59%                         454%
index/lz4 scan            1                 81%                         171%
index/lz4 hc              1                 79%                         176%
{noformat}

NOTE: chunk size = 1 means that there was only one document in the chunk (there is a compress+flush
every time the byte size of documents is >= the chunk size).

NOTE: these number have been computed with the whole index fitting in the I/O cache. The performance
should be more in favor of the compressing formats as soon as the index does not fit in the
I/O cache anymore.

There are still a few nocommits in the patch, but it should be easy to get rid of them. I'd
be very happy to have some feedback. :-)
                
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
>                 Key: LUCENE-4226
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4226
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Adrien Grand
>            Priority: Trivial
>         Attachments: CompressionBenchmark.java, CompressionBenchmark.java, LUCENE-4226.patch,
LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common for an
index with stored fields enabled to have most of its space used by the .fdt index file. To
prevent this .fdt file from growing too much, one option is to compress stored fields. Although
compression works rather well for large fields, this is not the case for small fields and
the compression ratio can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a {{StoredFieldsFormat}}
that compresses several documents in a single chunk of data. To see how it behaves in terms
of document deserialization speed and compression ratio, I've run several tests with different
index compression strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and
text were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and chunks of
6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and chunks of
6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 docs.
> For those who don't know Snappy, it is compression algorithm from Google which has very
high compression ratios, but compresses and decompresses data very quickly.
> {noformat}
> Format           Compression ratio     IndexReader.document time
> ————————————————————————————————————————————————————————————————
> uncompressed     100%                  100%
> doc/deflate 1     59%                  616%
> doc/deflate 9     58%                  595%
> doc/snappy        80%                  129%
> index/deflate 1   49%                  966%
> index/deflate 9   46%                  938%
> index/snappy      65%                  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with deflate, the .fdt
file shrinks by a factor of 2, much better than with doc-level compression). One other interesting
thing is that {{index/snappy}} is almost as compact as {{doc/deflate}} while it is more than
2x faster at retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for compressed
fields (one can expect better results for formats that have a high compression ratio since
they probably require fewer read/write operations from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message