Return-Path: X-Original-To: apmail-commons-issues-archive@minotaur.apache.org Delivered-To: apmail-commons-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D6621184C3 for ; Wed, 3 Feb 2016 13:42:50 +0000 (UTC) Received: (qmail 502 invoked by uid 500); 3 Feb 2016 13:42:40 -0000 Delivered-To: apmail-commons-issues-archive@commons.apache.org Received: (qmail 131 invoked by uid 500); 3 Feb 2016 13:42:40 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 99852 invoked by uid 99); 3 Feb 2016 13:42:39 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Feb 2016 13:42:39 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id CD0842C1F58 for ; Wed, 3 Feb 2016 13:42:39 +0000 (UTC) Date: Wed, 3 Feb 2016 13:42:39 +0000 (UTC) From: "Dawid Weiss (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (COMPRESS-333) bz2 stream decompressor is 10x slower than it could be MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/COMPRESS-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated COMPRESS-333: --------------------------------- Description: This is related to COMPRESS-291. In short: decompressing 7z archives was an order of magnitude slower in Java than with native tooling. My investigation showed that the problematic archive used bz2 streams inside. I then did a quick hack-experiment which took bz2 decompressor from the Apache Hadoop project (the Java version, not the native one) and replaced the default one used for bz2 stream decompression of the 7z archiver in commons. I then ran a quick benchmark on this file: {code} https://archive.org/download/stackexchange/english.stackexchange.com.7z {code} The decompression speeds are (SSD, the file was essentially fully cached in memory, so everything is CPU bound): {code} native {{7za}}: 13 seconds Commons (original): 222 seconds Commons (patched w/Hadoop bz2): 30 seconds Commons (patched w/BufferedInputStream): 28 seconds {code} Yes, it's still 3 times slower than native code, but it's no longer glacially slow... My patch is a quick and dirty proof of concept (not committable, see [1]), but it passes the tests. Some notes: - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd have to be either patched in the code or (better) decorated at a level above the low-level decoder, - I only substituted the decompressor in 7z, but obviously this could benefit in other places (zip, etc.); essentially, I'd remove BZip2CompressorInputStream entirely. - while I toyed around with the above idea I noticed a really annoying thing -- all streams are required to extend {{CompressorInputStream}}, which only adds one method to count the number of consumed bytes. This complicates the code and makes plugging in other implementations of InputStreams more cumbersome. I could get rid of CompressorInputStream entirely with a few minor changes to the code, but obviously this would be backward incompatible (see [2]). References: [1] GitHub fork, {{bzip2}} branch: https://github.com/dweiss/commons-compress/tree/bzip2 [2] Removal and cleanup of CompressorInputStream: https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458 was: This is related to COMPRESS-291. In short: decompressing 7z archives was an order of magnitude slower in Java than with native tooling. My investigation showed that the problematic archive used bz2 streams inside. I then did a quick hack-experiment which took bz2 decompressor from the Apache Hadoop project (the Java version, not the native one) and replaced the default one used for bz2 stream decompression of the 7z archiver in commons. I then ran a quick benchmark on this file: {code} https://archive.org/download/stackexchange/english.stackexchange.com.7z {code} The decompression speeds are (SSD, the file was essentially fully cached in memory, so everything is CPU bound): {code} native {{7za}}: 13 seconds Commons (original): 222 seconds Commons (patched w/Hadoop bz2): 30 seconds {code} Yes, it's still 3 times slower than native code, but it's no longer glacially slow... My patch is a quick and dirty proof of concept (not committable, see [1]), but it passes the tests. Some notes: - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd have to be either patched in the code or (better) decorated at a level above the low-level decoder, - I only substituted the decompressor in 7z, but obviously this could benefit in other places (zip, etc.); essentially, I'd remove BZip2CompressorInputStream entirely. - while I toyed around with the above idea I noticed a really annoying thing -- all streams are required to extend {{CompressorInputStream}}, which only adds one method to count the number of consumed bytes. This complicates the code and makes plugging in other implementations of InputStreams more cumbersome. I could get rid of CompressorInputStream entirely with a few minor changes to the code, but obviously this would be backward incompatible (see [2]). References: [1] GitHub fork, {{bzip2}} branch: https://github.com/dweiss/commons-compress/tree/bzip2 [2] Removal and cleanup of CompressorInputStream: https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458 > bz2 stream decompressor is 10x slower than it could be > ------------------------------------------------------ > > Key: COMPRESS-333 > URL: https://issues.apache.org/jira/browse/COMPRESS-333 > Project: Commons Compress > Issue Type: Improvement > Reporter: Dawid Weiss > > This is related to COMPRESS-291. In short: decompressing 7z archives was an order of magnitude slower in Java than with native tooling. > My investigation showed that the problematic archive used bz2 streams inside. I then did a quick hack-experiment which took bz2 decompressor from the Apache Hadoop project (the Java version, not the native one) and replaced the default one used for bz2 stream decompression of the 7z archiver in commons. > I then ran a quick benchmark on this file: > {code} > https://archive.org/download/stackexchange/english.stackexchange.com.7z > {code} > The decompression speeds are (SSD, the file was essentially fully cached in memory, so everything is CPU bound): > {code} > native {{7za}}: 13 seconds > Commons (original): 222 seconds > Commons (patched w/Hadoop bz2): 30 seconds > Commons (patched w/BufferedInputStream): 28 seconds > {code} > Yes, it's still 3 times slower than native code, but it's no longer glacially slow... > My patch is a quick and dirty proof of concept (not committable, see [1]), but it passes the tests. Some notes: > - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd have to be either patched in the code or (better) decorated at a level above the low-level decoder, > - I only substituted the decompressor in 7z, but obviously this could benefit in other places (zip, etc.); essentially, I'd remove BZip2CompressorInputStream entirely. > - while I toyed around with the above idea I noticed a really annoying thing -- all streams are required to extend {{CompressorInputStream}}, which only adds one method to count the number of consumed bytes. This complicates the code and makes plugging in other implementations of InputStreams more cumbersome. I could get rid of CompressorInputStream entirely with a few minor changes to the code, but obviously this would be backward incompatible (see [2]). > References: > [1] GitHub fork, {{bzip2}} branch: https://github.com/dweiss/commons-compress/tree/bzip2 > [2] Removal and cleanup of CompressorInputStream: https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458 -- This message was sent by Atlassian JIRA (v6.3.4#6332)