lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark
Date Sun, 12 Apr 2009 11:31:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698211#action_12698211
] 

Shai Erera commented on LUCENE-1591:
------------------------------------

That's the way I wrap FIS with BZIP:

{code}
      if (doBzipCompression) {
        // According to CBZip2InputStream's documentation, we should first
        // consume the first two file header chars ('B' and 'Z'), as well as 
        // wrap the underlying stream with a BufferedInputStream, since CBZip2IS
        // uses the read() method exclusively.
        fileIS = new BufferedInputStream(fileIS, READER_BUFFER_BYTES);
        fileIS.read(); fileIS.read();
        fileIS = new CBZip2InputStream(fileIS);
      }
{code}

bq. Is it possible your bunzipping code is messing up the XML?

I successfully read the file and compressed it with Java's GZIP classes, however I did not
attempt to parse the XML itself. Did you run EnwikiDocMaker on the actual XML or the bz2 archive?
The 20070527 run should end soon (I hope - it reached 2.2M documents, so if it doesn't fail,
I guess that bzip wrapping is very unlikely to affect the XML parsing.

bq. Shai, why did it take 9 hours to get to that exception? Is bunzip that slow? That seems
crazy. 

I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong
server. You can download the patch and the jar and try it out on your machine.
But yes, I did notice bzip is very slow compared to gzip, however it has better compression
ration. I do want to measure the times though, to give more accurate numbers, but in order
to do that I need to finish a successful run first.

bq. Can you run only your bunzip code and confirm it ...

I would have done that, but the output XML is 17GB, and doing it twice is not an option on
my TP. That's why I wanted this bzip thing in the first place :)
I'll try to do that with the 20070527 version, which hopefully will be ~half the size ...



> Enable bzip compression in benchmark
> ------------------------------------
>
>                 Key: LUCENE-1591
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1591
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>         Attachments: ant-1.7.1.jar, LUCENE-1591.patch
>
>
> bzip compression can aid the benchmark package by not requiring extracting bzip files
(such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false
and in the relevant tasks either decompress the input file or compress the output file using
the bzip streams.
> It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream
and GZIPInputStream which compress/decompress files using the bzip algorithm.
> bzip is known to be superior in its compression performance to the gzip algorithm (~20%
better compression), although it does the compression/decompression a bit slower.
> I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker
and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes,
so it can be inherited by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message