hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files
Date Tue, 06 Apr 2010 21:43:33 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854201#action_12854201
] 

David Ciemiewicz commented on MAPREDUCE-469:
--------------------------------------------

bzip2 compression format also supports concatenation of individual bzip2 compressed files
into a single file.

bzcat has absolutely no problem reading all of the data in one of these concatenated files.

Unfortunately, both Hadoop Streaming and Pig only see about 2% of the data from the original
file in my case.  That's a 98% effective data loss.



> Support concatenated gzip and bzip2 files
> -----------------------------------------
>
>                 Key: MAPREDUCE-469
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Tom White
>            Assignee: Ravi Gummadi
>
> When running MapReduce with concatenated gzip files as input only the first part is read,
which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message