Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Fri, 5 Oct 2012 18:04:03 +0000 (UTC)
From: "Simon Willnauer (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <755302785.1843.1349460243317.JavaMail.jiratomcat@arcas>
In-Reply-To: <389874577.59186.1342461515785.JavaMail.jiratomcat@issues-vm>
Subject: [jira] [Commented] (LUCENE-4226) Efficient compression of small to
 medium stored fields
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-4226?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D134=
70509#comment-13470509 ]=20

Simon Willnauer commented on LUCENE-4226:
-----------------------------------------

bq. By the way, would it be possible to have one of the Jenkins servers to =
run lucene-core tests with -Dtests.codec=3DCompressing for some time?

FYI - http://builds.flonkings.com/job/Lucene-trunk-Linux-Java6-64-test-only=
-compressed/
               =20
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
>                 Key: LUCENE-4226
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4226
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>             Fix For: 4.1, 5.0
>
>         Attachments: CompressionBenchmark.java, CompressionBenchmark.java=
, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patc=
h, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.pat=
ch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very co=
mmon for an index with stored fields enabled to have most of its space used=
 by the .fdt index file. To prevent this .fdt file from growing too much, o=
ne option is to compress stored fields. Although compression works rather w=
ell for large fields, this is not the case for small fields and the compres=
sion ratio can be very close to 100%, even with efficient compression algor=
ithms.
> In order to improve the compression ratio for small fields, I've written =
a {{StoredFieldsFormat}} that compresses several documents in a single chun=
k of data. To see how it behaves in terms of document deserialization speed=
 and compression ratio, I've run several tests with different index compres=
sion strategies on 100,000 docs from Mike's 1K Wikipedia articles (title an=
d text were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level =3D 1),
>  - docs compressed with deflate (compression level =3D 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level =3D 1=
) and chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level =3D 9=
) and chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of=
 6 docs.
> For those who don't know Snappy, it is compression algorithm from Google =
which has very high compression ratios, but compresses and decompresses dat=
a very quickly.
> {noformat}
> Format           Compression ratio     IndexReader.document time
> =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=
=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=
=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=
=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=
=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=
=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=
=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=
=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94
> uncompressed     100%                  100%
> doc/deflate 1     59%                  616%
> doc/deflate 9     58%                  595%
> doc/snappy        80%                  129%
> index/deflate 1   49%                  966%
> index/deflate 9   46%                  938%
> index/snappy      65%                  264%
> {noformat}
> (doc =3D doc-level compression, index =3D index-level compression)
> I find it interesting because it allows to trade speed for space (with de=
flate, the .fdt file shrinks by a factor of 2, much better than with doc-le=
vel compression). One other interesting thing is that {{index/snappy}} is a=
lmost as compact as {{doc/deflate}} while it is more than 2x faster at retr=
ieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for=
 compressed fields (one can expect better results for formats that have a h=
igh compression ratio since they probably require fewer read/write operatio=
ns from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org