commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Gustie (JIRA)" <>
Subject [jira] [Created] (COMPRESS-376) decompressConcatenated improvement
Date Thu, 08 Dec 2016 15:03:59 GMT
Jeremy Gustie created COMPRESS-376:

             Summary: decompressConcatenated improvement
                 Key: COMPRESS-376
             Project: Commons Compress
          Issue Type: Improvement
          Components: Compressors
            Reporter: Jeremy Gustie

First the problem I am seeing: in general I am always setting {{decompressConcatenated}} to
{{true}}, most of the time this works fine. However, it seems like some versions of Python
tarfile will pad a compressed TAR file with null bytes. The null bytes are recognized as garbage,
causing decompression to fail. Unfortunately this failure occurs while filling a buffer for
data used to read the final entry in the TAR file causing {{TarArchiveInputStream.getNextEntry}}
to fail before the last entry can be returned.

There are a couple of potential solutions I can see:

1. The easiest thing to do we be to special case the null padding and just terminate without
failing (in the {{GzipCompressorInputStream.init}} method, this amounts to adding a check
for {{magic0 == 0 && (magic1 == 0 || magic1 == -1)}} and returning {{false}}). Perhaps
draining the underlying stream to ensure that the remaining bytes are all null could reduce
the likelihood of a false positive recognizing the padding.

2. Change {{decompressConcatenated}} to a tri-state value (maybe add an extra {{ignoreGarbage}}
flag) to suppress the failure; basically concatenated streams would be decompressed only if
the appropriate magic is found. This has API impact but completely preserves backwards compatibility.

3. Finally, deferring the failure to the next read attempt may also be a viable solution that
nearly preserves backwards compatibility. As I mentioned before, the "Garbage after..." error
occurs while reading the final entry in a TAR file: if the current read (which contains all
of the final data from the compression stream) were allowed to complete normally, the downstream
consumer might also complete normally; the next attempt to read (the garbage past the end
of the compression stream) would be the read that fails with the "Garbage after..." error.
This gives the downstream code the best opportunity to both process the full compression stream
and receive the unexpected garbage failure.

I was mostly looking at the {{GzipCompressorInputStream}}, I suspect similar changes would
be needed in the other decompress-concatenated compressor streams.

This message was sent by Atlassian JIRA

View raw message