hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: Improve data locality for MR job processing tar.gz files
Date Mon, 09 May 2011 18:25:17 GMT
You could write your own input format class to handle breaking out the
tar files for you. If you subclass FileInputFormat, Hadoop will handle
decompressing the files because of the .gz file extension. Your input
format would just need to use a Java tar file library (e.g.
http://code.google.com/p/jtar/) to give your mappers access to the
files underneath.


On Mon, May 9, 2011 at 2:48 AM, Eric <eric.xkcd@gmail.com> wrote:
> Hi,
> I have a job that processes raw data inside tarballs. As job input I have a
> text file listing the full HDFS path of the files that need to be processed,
> e.g.:
> ...
> /user/eric/file451.tar.gz
> /user/eric/file452.tar.gz
> /user/eric/file453.tar.gz
> ...
> Each mapper gets one line of input at a time, moves the tarball to local
> storage, unpacks it and processes all files inside.
> This works very well. However: changes are high that a mapper gets to
> process a file that is not stored locally on that node so it needs to be
> transferred.
> My question: is there any way to get better locality in a job as described
> above?
> Best regards,
> Eric

Joseph Echeverria
Cloudera, Inc.

View raw message