lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark
Date Sun, 12 Apr 2009 19:57:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698265#action_12698265
] 

Shai Erera commented on LUCENE-1591:
------------------------------------

Here some numbers:

* Reading the enwiki bz2 file with CBZip2InputStream, wrapped as a BufferedReader and reading
one line at a time took *28m*. Unzipping with WinRAR took about *~30m* (this includes also
writing the uncompressed data to disk). So in that respect, the code does not fall short of
other bunzip tools (at least not WinRAR).
* Before the change, the time to read the compressed data, parse and write to a one-line file,
compressed took 7h (3.1M documents were read). After the change (wrapping with BOS and removing
flush()) it took 2h, so significant improvement here.

Overall, I think the performance of the BZIP classes is reasonable. Most of the time spent
in the algorithm is in compressing the data, which is usually a process done only once. The
result is a 2.5GB enwiki file compressed to a 2.31GB one-line file (8.5GB uncompressed content).

I compared the time it takes to read 100k lines from the compressed and un-compressed one-line
file: compressed-2.26m, un-compressed-1.36m ({color:red}-66%{color}). The difference is significant,
however I'm not sure how much is it from the overall process (i.e., reading the documents
and indexing them). On my machine it would take 1.1 hours to read the data, but I'm sure it
will take more to index it, and the indexing time is the same whether we read the data from
a bzip archive or not.

I'll attach the patch shortly, and I think overall this is a good addition. It is off by default,
and configurable, so if someone doesn't care about disk space, he can always run the indexing
algorithm on an un-compressed one-line file.

> Enable bzip compression in benchmark
> ------------------------------------
>
>                 Key: LUCENE-1591
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1591
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>         Attachments: ant-1.7.1.jar, LUCENE-1591.patch
>
>
> bzip compression can aid the benchmark package by not requiring extracting bzip files
(such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false
and in the relevant tasks either decompress the input file or compress the output file using
the bzip streams.
> It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream
and GZIPInputStream which compress/decompress files using the bzip algorithm.
> bzip is known to be superior in its compression performance to the gzip algorithm (~20%
better compression), although it does the compression/decompression a bit slower.
> I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker
and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes,
so it can be inherited by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message