lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Updated: (LUCENE-1591) Enable bzip compression in benchmark
Date Sun, 12 Apr 2009 20:07:14 GMT


Shai Erera updated LUCENE-1591:

    Attachment: LUCENE-1591.patch

Patch includes:
* Wrapping the FileOutputStream with a BufferedOutputStream.
* Removing the calls to flush().
* Enhancement to EnwikiDocMaker's startElement and endElement - instead of calling String.equals
on the qualified name and compare on 5 different strings, I added a static map from String
to Integer and a static method getElementType which returns an int. I then changed those methods
to do a 'switch' on the type. I haven't measured the perf. gain, but it's clear it should
improve things ...

There is an open question regarding the ant-1.7.1.jar dependency. Uwe mentioned the commons
Compress project, which handles the bzip format (as well as others). I took a look and found
no place to download a jar, as well as this looks like a 'young' project, with very little
documentation. This is not to say the code is of low quality or not be trusted, it's just
that I prefer the ant dependency, at least until this project matures enough. And anyway I
guess everyone who uses Lucene has Ant in his system, so this doesn't look like a major dependency.

However, if you think otherwise, then we should get a jar from there (checking out the code
and building it manually is the only way I see, but please correct me if I'm wrong) and adapt
the code to use it, do perf. measurements again etc.

> Enable bzip compression in benchmark
> ------------------------------------
>                 Key: LUCENE-1591
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>         Attachments: ant-1.7.1.jar, LUCENE-1591.patch, LUCENE-1591.patch
> bzip compression can aid the benchmark package by not requiring extracting bzip files
(such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false
and in the relevant tasks either decompress the input file or compress the output file using
the bzip streams.
> It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream
and GZIPInputStream which compress/decompress files using the bzip algorithm.
> bzip is known to be superior in its compression performance to the gzip algorithm (~20%
better compression), although it does the compression/decompression a bit slower.
> I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker
and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes,
so it can be inherited by all sub-classes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message