hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Lu (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7076) Splittable Gzip
Date Fri, 09 Dec 2011 07:44:47 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165911#comment-13165911

Luke Lu commented on HADOOP-7076:

The code especially the docs/comments looks good to me. I wonder if the name of the codec
would better be called SkipGzipCodec instead of SplittableGzipCodec mostly because the latter
would be a good name for a real splittable variant of gzip format and that the former sounds
weird enough to prompt user to read the documentation to find out that the codec actually
do O(s*n) io, which is mostly suitable for processing archived gzipped files infrequently
with number of splits less than the compression factor (uncompressed size/compressed size)
(not a precise criterion BTW). Otherwise, you'd better off convert these files into a real
splittable compressed format.
> Splittable Gzip
> ---------------
>                 Key: HADOOP-7076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7076
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>         Attachments: HADOOP-7076-2011-01-26.patch, HADOOP-7076-2011-01-29.patch, HADOOP-7076-2011-02-05.patch,
HADOOP-7076-2011-02-06.patch, HADOOP-7076-2011-05-18.patch, HADOOP-7076-2011-08-05-2255.patch,
HADOOP-7076-2011-08-05-2315.patch, HADOOP-7076-2011-12-04-2332.patch, HADOOP-7076-branch-0.22.patch,
> Files compressed with the gzip codec are not splittable due to the nature of the codec.
> This limits the options you have scaling out when reading large gzipped input files.
> Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I figured that
for some use cases wasting some resources may result in a shorter job time under certain conditions.
> So reading the entire input file from the start for each split (wasting resources!!)
may lead to additional scalability.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message