hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Pendleton (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-223) Should be able to re-run jobs, collecting only missing output
Date Tue, 16 May 2006 18:11:05 GMT
Should be able to re-run jobs, collecting only missing output

         Key: HADOOP-223
         URL: http://issues.apache.org/jira/browse/HADOOP-223
     Project: Hadoop
        Type: New Feature

  Components: mapred  
    Reporter: Bryan Pendleton
    Priority: Minor

For jobs with no side effects (roughly == jobs with speculative execution enabled), if partial
output has been generated, it should be possible to re-run the job, and fill in the missing
pieces. I have now run the same job twice, once finishing 42 of 44 reduce tasks, another time
finishing only 17. Each time, many nodes have failed, causing many many tasks to fail ( in
one case, 5k failures from 15k map tasks, 23 failures from 44 reduces), but some valid output
was generated. Since the output is only dependent on the input, and both jobs used the same
input, I will now be able to combine these two failed task outputs to get a completed job's
output. This should be something that can be more automatic.

In particular, it should be possible to resubmit a job, with a list of partitions that should
be ignored. A special Combiner, or pre-Combiner, would throw out any map output for partitions
that have already been successfully completed, thus reducing the amount of data that needs
to be reduced to complete the job. It would, of course, be nice to support "filling in" existing
outputs, rather than having to do a move operation on completed outputs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message