commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (COMPRESS-333) bz2 stream decompressor is 10x slower than it could be
Date Wed, 03 Feb 2016 13:42:39 GMT

    [ https://issues.apache.org/jira/browse/COMPRESS-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130396#comment-15130396
] 

Dawid Weiss edited comment on COMPRESS-333 at 2/3/16 1:41 PM:
--------------------------------------------------------------

It's even better than I thought... So I wondered why the hell the slowdown. Looked at both
implementations -- they differ, but they're clearly derivatives of the same initial code by
Keiron Liddle.

So... why the slowdown?

Turns out it's the buffered vs. unbuffered reads, as is typical. This one-liner patch brings
decompression speed of 7z files on par with my Hadoop fork...
{code}
diff --git a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
index 398783f..809a9dc 100644
--- a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
+++ b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
@@ -17,6 +17,7 @@
  */
 package org.apache.commons.compress.archivers.sevenz;

+import java.io.BufferedInputStream;
 import java.io.ByteArrayInputStream;
 import java.io.Closeable;
 import java.io.DataInput;
@@ -853,8 +854,10 @@ public class SevenZFile implements Closeable {
     private InputStream buildDecoderStack(final Folder folder, final long folderOffset,
                 final int firstPackStreamIndex, SevenZArchiveEntry entry) throws IOException
{
         file.seek(folderOffset);
-        InputStream inputStreamStack = new BoundedRandomAccessFileInputStream(file,
-                archive.packSizes[firstPackStreamIndex]);
+        InputStream inputStreamStack =
+            new BufferedInputStream(
+              new BoundedRandomAccessFileInputStream(file,
+                  archive.packSizes[firstPackStreamIndex]));
         LinkedList<SevenZMethodConfiguration> methods = new LinkedList<SevenZMethodConfiguration>();
         for (final Coder coder : folder.getOrderedCoders()) {
             if (coder.numInStreams != 1 || coder.numOutStreams != 1) {
{code}

RandomAccessFile is in general *very* slow in byte-by-byte access (the same applies to channels).
An in-Java buffer of some sort speeds up things an order of magnitude. I'd review the remaining
code too to make sure no direct RAF.read() accesses take place.


was (Author: dweiss):
It's even better than I thought... So I wondered why the hell the slowdown. Looked at both
implementations -- they differ, but they're clearly derivatives of the same initial code by
Keiron Liddle.

So... why the slowdown?

Turns out it's the buffered vs. unbuffered reads, as is typical. This one-liner patch brings
decompression speed of 7z files on par with my Hadoop fork...
{code}
diff --git a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
index 398783f..809a9dc 100644
--- a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
+++ b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
@@ -17,6 +17,7 @@
  */
 package org.apache.commons.compress.archivers.sevenz;

+import java.io.BufferedInputStream;
 import java.io.ByteArrayInputStream;
 import java.io.Closeable;
 import java.io.DataInput;
@@ -853,8 +854,10 @@ public class SevenZFile implements Closeable {
     private InputStream buildDecoderStack(final Folder folder, final long folderOffset,
                 final int firstPackStreamIndex, SevenZArchiveEntry entry) throws IOException
{
         file.seek(folderOffset);
-        InputStream inputStreamStack = new BoundedRandomAccessFileInputStream(file,
-                archive.packSizes[firstPackStreamIndex]);
+        InputStream inputStreamStack =
+            new BufferedInputStream(
+              new BoundedRandomAccessFileInputStream(file,
+                  archive.packSizes[firstPackStreamIndex]));
         LinkedList<SevenZMethodConfiguration> methods = new LinkedList<SevenZMethodConfiguration>();
         for (final Coder coder : folder.getOrderedCoders()) {
             if (coder.numInStreams != 1 || coder.numOutStreams != 1) {
{code}

> bz2 stream decompressor is 10x slower than it could be
> ------------------------------------------------------
>
>                 Key: COMPRESS-333
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-333
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>
> This is related to COMPRESS-291. In short: decompressing 7z archives was an order of
magnitude slower in Java than with native tooling.
> My investigation showed that the problematic archive used bz2 streams inside. I then
did a quick hack-experiment which took bz2 decompressor from the Apache Hadoop project (the
Java version, not the native one) and replaced the default one used for bz2 stream decompression
of the 7z archiver in commons.
> I then ran a quick benchmark on this file:
> {code}
> https://archive.org/download/stackexchange/english.stackexchange.com.7z
> {code}
> The decompression speeds are (SSD, the file was essentially fully cached in memory, so
everything is CPU bound):
> {code}
> native {{7za}}: 13 seconds
> Commons (original): 222 seconds
> Commons (patched w/Hadoop bz2): 30 seconds
> {code}
> Yes, it's still 3 times slower than native code, but it's no longer glacially slow...

> My patch is a quick and dirty proof of concept (not committable, see [1]), but it passes
the tests. Some notes:
> - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd have to be
either patched in the code or (better) decorated at a level above the low-level decoder,
> - I only substituted the decompressor in 7z, but obviously this could benefit in other
places (zip, etc.); essentially, I'd remove BZip2CompressorInputStream entirely.
> - while I toyed around with the above idea I noticed a really annoying thing -- all streams
are required to extend {{CompressorInputStream}}, which only adds one method to count the
number of consumed bytes. This complicates the code and makes plugging in other implementations
of InputStreams more cumbersome. I could get rid of CompressorInputStream entirely with a
few minor changes to the code, but obviously this would be backward incompatible (see [2]).
> References:
> [1] GitHub fork, {{bzip2}} branch: https://github.com/dweiss/commons-compress/tree/bzip2
> [2] Removal and cleanup of CompressorInputStream: https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message