Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 57E12200BF1 for ; Tue, 20 Dec 2016 03:58:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 56B5D160B35; Tue, 20 Dec 2016 02:58:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A738C160B21 for ; Tue, 20 Dec 2016 03:57:59 +0100 (CET) Received: (qmail 25334 invoked by uid 500); 20 Dec 2016 02:57:58 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 25298 invoked by uid 99); 20 Dec 2016 02:57:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Dec 2016 02:57:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7C8112C03DF for ; Tue, 20 Dec 2016 02:57:58 +0000 (UTC) Date: Tue, 20 Dec 2016 02:57:58 +0000 (UTC) From: "Jeremy Gustie (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (COMPRESS-376) decompressConcatenated improvement MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 20 Dec 2016 02:58:00 -0000 [ https://issues.apache.org/jira/browse/COMPRESS-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763045#comment-15763045 ] Jeremy Gustie commented on COMPRESS-376: ---------------------------------------- Agree, definitely more sophisticated... I have been trying to isolate what it is that made the archive we found fail, my best guess is that the last entry needs to cross a boundary occurring at {{length - 512}} bytes of the file. I was able to produce an 8K file that only allows me to read the final entry if I do not buffer between the compressor and archiving streams by doing this: {code} head -c 7680 test1.rnd head -c 10 test2.rnd tar cz test1.rnd test2.rnd | cat - /dev/zero | head -c 8192 > COMPRESS-376.tar.gz {code} I haven't had much of a chance to figure out exactly why this is, but hopefully this at least gives you an archive that can fail that test. > decompressConcatenated improvement > ---------------------------------- > > Key: COMPRESS-376 > URL: https://issues.apache.org/jira/browse/COMPRESS-376 > Project: Commons Compress > Issue Type: Improvement > Components: Compressors > Reporter: Jeremy Gustie > > First the problem I am seeing: in general I am always setting {{decompressConcatenated}} to {{true}}, most of the time this works fine. However, it seems like some versions of Python tarfile will pad a compressed TAR file with null bytes. The null bytes are recognized as garbage, causing decompression to fail. Unfortunately this failure occurs while filling a buffer for data used to read the final entry in the TAR file causing {{TarArchiveInputStream.getNextEntry}} to fail before the last entry can be returned. > There are a couple of potential solutions I can see: > 1. The easiest thing to do we be to special case the null padding and just terminate without failing (in the {{GzipCompressorInputStream.init}} method, this amounts to adding a check for {{magic0 == 0 && (magic1 == 0 || magic1 == -1)}} and returning {{false}}). Perhaps draining the underlying stream to ensure that the remaining bytes are all null could reduce the likelihood of a false positive recognizing the padding. > 2. Change {{decompressConcatenated}} to a tri-state value (maybe add an extra {{ignoreGarbage}} flag) to suppress the failure; basically concatenated streams would be decompressed only if the appropriate magic is found. This has API impact but completely preserves backwards compatibility. > 3. Finally, deferring the failure to the next read attempt may also be a viable solution that nearly preserves backwards compatibility. As I mentioned before, the "Garbage after..." error occurs while reading the final entry in a TAR file: if the current read (which contains all of the final data from the compression stream) were allowed to complete normally, the downstream consumer might also complete normally; the next attempt to read (the garbage past the end of the compression stream) would be the read that fails with the "Garbage after..." error. This gives the downstream code the best opportunity to both process the full compression stream and receive the unexpected garbage failure. > I was mostly looking at the {{GzipCompressorInputStream}}, I suspect similar changes would be needed in the other decompress-concatenated compressor streams. -- This message was sent by Atlassian JIRA (v6.3.4#6332)