hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Making gzip splittable for Hadoop
Date Fri, 30 Mar 2012 14:07:27 GMT

In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.

I've created an addon for Hadoop that makes this possible.

I've reworked the patch I initially created to be included in hadoop (see
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.

I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:

This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).

Running "mvn package" automatically generates an RPM on my CentOS system.

Have fun with it an let me know what you think.

Best regards / Met vriendelijke groeten,

Niels Basjes

View raw message