hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric <eric.x...@gmail.com>
Subject Improve data locality for MR job processing tar.gz files
Date Mon, 09 May 2011 09:48:10 GMT

I have a job that processes raw data inside tarballs. As job input I have a
text file listing the full HDFS path of the files that need to be processed,

Each mapper gets one line of input at a time, moves the tarball to local
storage, unpacks it and processes all files inside.
This works very well. However: changes are high that a mapper gets to
process a file that is not stored locally on that node so it needs to be

My question: is there any way to get better locality in a job as described

Best regards,

View raw message