hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Heidemann <jo...@isi.edu>
Subject extracting input to a task from a (streaming) job?
Date Thu, 07 Aug 2008 16:30:19 GMT

I have a large Hadoop streaming job that generally works fine,
but a few (2-4) of the ~3000 maps and reduces have problems.
To make matters worse, the problems are system-dependent (we run an a
cluster with machines of slightly different OS versions).
I'd of course like to debug these problems, but they are embedded in a
large job.

Is there a way to extract the input given to a reducer from a job, given
the task identity?  (This would also be helpful for mappers.)

This is clearly technically *possible*, since hadoop can rerun the jobs
if they fail.  But is an external program that actually does it?
Or are there instructions for poking around on the compute nodes' local
disks to assemble it by hand?  Or better suggestions?

It would be a real boon for people developing map and reduce user code.

Thanks for any pointers.
   -John Heidemann

View raw message