hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric <eric.x...@gmail.com>
Subject Improve data locality for MR job processing tar.gz files
Date Mon, 09 May 2011 09:48:10 GMT
Hi,

I have a job that processes raw data inside tarballs. As job input I have a
text file listing the full HDFS path of the files that need to be processed,
e.g.:
...
/user/eric/file451.tar.gz
/user/eric/file452.tar.gz
/user/eric/file453.tar.gz
...

Each mapper gets one line of input at a time, moves the tarball to local
storage, unpacks it and processes all files inside.
This works very well. However: changes are high that a mapper gets to
process a file that is not stored locally on that node so it needs to be
transferred.

My question: is there any way to get better locality in a job as described
above?

Best regards,
Eric

Mime
View raw message