lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark
Date Sun, 12 Apr 2009 12:12:14 GMT


Shai Erera commented on LUCENE-1591:

bq. I downloaded the bz2 2008036

I'm almost sure its a typo, but just to verify - did download the 20090306 (enwiki-20090306-pages-articles.xml.bz2),
or 2008036?

Anyway, I think I've found a problem. In the javadocs, they document that the IS version uses
the readByte() exclusively, but don't say anything regarding their OS version. I read the
code and noticed it always calls write() and never uses the array version.
So I wrapped the FOS with a BOS (bufSize=64k) and then with BZOS. I did a short test, reading
2000 records from the 20070527 file, before and after the change:

|| Num Docs || Before || After || %tg
| 2000 | 106s | 30s | {color:green}72{color}

I think that if that improvement is stable, than the 9 hours run should drop to ~3 hours,
which seems right. I didn't measure the time to unzip the file using WinRAR (the first time
I tried it), but it was a couple of hours run.

Once the current run will complete, I'll kick off a new one with that code change and note
the time difference. I'm eager to see it speeds up, but I want to complete a successful run
before :)

> Enable bzip compression in benchmark
> ------------------------------------
>                 Key: LUCENE-1591
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>         Attachments: ant-1.7.1.jar, LUCENE-1591.patch
> bzip compression can aid the benchmark package by not requiring extracting bzip files
(such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false
and in the relevant tasks either decompress the input file or compress the output file using
the bzip streams.
> It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream
and GZIPInputStream which compress/decompress files using the bzip algorithm.
> bzip is known to be superior in its compression performance to the gzip algorithm (~20%
better compression), although it does the compression/decompression a bit slower.
> I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker
and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes,
so it can be inherited by all sub-classes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message