lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark
Date Sun, 12 Apr 2009 11:56:14 GMT


Michael McCandless commented on LUCENE-1591:

bq. Did you run EnwikiDocMaker on the actual XML or the bz2 archive?

I downloaded the bz2 2008036 Wikipedia export, ran bunzip2 on the command line, then had to
patch Xerces JAR to get it to parse the XML successfully.

bq. I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong

Hmm -- I wonder how long bunzip2 would take on the TP 60.  Time to upgrade ;)  Get yourself
an X25 SSD!

bq. I would have done that, but the output XML is 17GB, and doing it twice is not an option
on my TP. That's why I wanted this bzip thing in the first place 

Ahh OK :)

> Enable bzip compression in benchmark
> ------------------------------------
>                 Key: LUCENE-1591
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>         Attachments: ant-1.7.1.jar, LUCENE-1591.patch
> bzip compression can aid the benchmark package by not requiring extracting bzip files
(such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false
and in the relevant tasks either decompress the input file or compress the output file using
the bzip streams.
> It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream
and GZIPInputStream which compress/decompress files using the bzip algorithm.
> bzip is known to be superior in its compression performance to the gzip algorithm (~20%
better compression), although it does the compression/decompression a bit slower.
> I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker
and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes,
so it can be inherited by all sub-classes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message