hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files
Date Wed, 07 Apr 2010 00:16:33 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854268#action_12854268
] 

David Ciemiewicz commented on MAPREDUCE-469:
--------------------------------------------

Unfortunately I discovered that concatenated bzip2 files did not work in Map-Reduce until
*AFTER* I went and concatenated 3TB and over 250K compressed files.

A colleague suggested that I "fix" my data using the following approach:

hadoop dfs -cat X | bunzip2 | bzip2 | hadoop dfs -put - X.new

I tried this with a 3GB single file concatenation of multiple bzip2 compressed files.

This process took just over an hour with compression taking 5-6X longer than decompression
(as measured in CPU utilization).

It only took several minutes to concatenate the multiple part files into a single file.


I think that this points out that decompressing and recompressing data is not really a viable
solution for creating large concatenations of smaller files.

The best performing solution is to create the smaller part files in parallel with a bunch
of reducers, then concatenate them later into one (or several) larger files.

And so fixing Hadoop Map Reduce to be able to read concatenations of files is actually probably
the highest return on investment by the community.




> Support concatenated gzip and bzip2 files
> -----------------------------------------
>
>                 Key: MAPREDUCE-469
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Tom White
>            Assignee: Ravi Gummadi
>
> When running MapReduce with concatenated gzip files as input only the first part is read,
which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message