hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Basjes (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-7076) Splittable Gzip
Date Wed, 18 May 2011 12:41:47 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Niels Basjes updated HADOOP-7076:

    Attachment: HADOOP-7076-2011-05-18.patch

This patch has no code changes compared to the previous one. Only the Javadoc has been improved
to provide additional suggestions for optimal usage of this patch.

> Splittable Gzip
> ---------------
>                 Key: HADOOP-7076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7076
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>         Attachments: HADOOP-7076-2011-01-26.patch, HADOOP-7076-2011-01-29.patch, HADOOP-7076-2011-02-05.patch,
HADOOP-7076-2011-02-06.patch, HADOOP-7076-2011-05-18.patch, HADOOP-7076.patch
> Files compressed with the gzip codec are not splittable due to the nature of the codec.
> This limits the options you have scaling out when reading large gzipped input files.
> Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I figured that
for some use cases wasting some resources may result in a shorter job time under certain conditions.
> So reading the entire input file from the start for each split (wasting resources!!)
may lead to additional scalability.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message