hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Percy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-8522) ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used
Date Thu, 21 Jun 2012 08:47:42 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mike Percy updated HADOOP-8522:
-------------------------------

    Attachment: HADOOP-8522-2.patch

I am attaching a patch to make the behavior of non-native resetState() consistent with native
resetState(), which will make them both compliant with RFC1952 and "gunzip".

Implementation totally lifted from HBase:
https://svn.apache.org/viewvc/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/ReusableStreamGzipCodec.java?revision=1342856&view=markup

I added one unit test which simply tests that the output is readable with GZipInputStream,
and one in which I had to comment-out the assert() because JDK GZipInputStream cannot handle
multi-member gzip files. I'm open to suggestions for improving the unit test... it looks like
HBase actually stores the expected bytes and requires an exact match in their test.

Testing done: manual inspection that the data generated via the 2nd unit test creates headers,
trailers, crc32 checksums, and lengths corresponding to the two members included. Also verified
that the output of unit test 2 is readable with "gunzip" and that the output matches the provided
input.
                
> ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are
used
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8522
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8522
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 1.0.3, 2.0.0-alpha
>            Reporter: Mike Percy
>         Attachments: HADOOP-8522-2.patch
>
>
> ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are
used. The issue is that finish() flushes the compressor buffer and writes the gzip CRC32 +
data length trailer. After that, resetState() does not repeat the gzip header, but simply
starts writing more deflate-compressed data. The resultant files are not readable by the Linux
"gunzip" tool. ResetableGzipOutputStream should write valid multi-member gzip files.
> The gzip format is specified in [RFC 1952|https://tools.ietf.org/html/rfc1952].

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message