hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Basjes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-7076) Splittable Gzip
Date Sat, 29 Jan 2011 23:26:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988520#action_12988520
] 

Niels Basjes commented on HADOOP-7076:
--------------------------------------

To clarify the bugs I fixed in the existing compression tests:
1) The generated test file starts with a line number. In the original version this line number
is done BINARY and then the file is read as ASCII records with line endings as separator.
I'm surprised the test actually worked in the original form. I changed this to ASCII all the
way.
2) The decompressor is reused. But the decompressor must be reset before it can be reused.

> Splittable Gzip
> ---------------
>
>                 Key: HADOOP-7076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7076
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>         Attachments: HADOOP-7076-2011-01-26.patch, HADOOP-7076-2011-01-29.patch, HADOOP-7076.patch
>
>
> Files compressed with the gzip codec are not splittable due to the nature of the codec.
> This limits the options you have scaling out when reading large gzipped input files.
> Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I figured that
for some use cases wasting some resources may result in a shorter job time under certain conditions.
> So reading the entire input file from the start for each split (wasting resources!!)
may lead to additional scalability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message