lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-1539) Improve Benchmark
Date Wed, 08 Apr 2009 15:25:12 GMT


Shai Erera commented on LUCENE-1539:

Is it also interesting to add extensions to EnwikiDocMaker, WriteLineDoc and LineDocMaker
which can read/write the content in a bzip format?
I downloaded the latest Enwiki dump, 4.5 GB in bzip format. Extracted XML size is 17GB. I
thought to myslef that I don't have a real reason to extract it - I can read the content directly
from the bzip-type file.

So I looked around and found out that in ant.jar there are two classes which can read/write
that format. Just to compare, I gzipped the XML file and the result was 5.1GB file (~13% larger).
The general measurements on the web also show bzip is superior to gzip, although it probably
runs a bit slower.

I then ran the WriteLineDoc task, to produce the one-line-per-document text file, and stopped
when it reache 228MB. Again, I zipped, gzipped and bzipped the file, and the bzip format was
smaller by ~20%.

So I was wondering - besides the speed of writing from a compressed archive, which is slwoer
than reading from a plain XML or TXT file, is there a reason why we don't use bzip/gzip when
reading content? It will save a lot of space and I'm not sure that part of the indexing is
what's most important.
However, I'm aware that some people might find it better to read from plain files, so I suggest
we just have extensions which can read/write the compressed format.
The question is, assuming you agree to it, should we use bzip (which requires external library)
or gzip which is in the JDK, does not compress as good as bzip, but might have better performance
(I can give it some measurements if needed, but the main question I have is whether we want
to introduce a dependency on another library).

If this belongs in a separate issue, let me know.

> Improve Benchmark
> -----------------
>                 Key: LUCENE-1539
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch,,
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message