hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-460) Should be able to re-run jobs, collecting only missing output
Date Fri, 18 Jul 2014 04:57:05 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066019#comment-14066019

Allen Wittenauer commented on MAPREDUCE-460:

So is there actually another JIRA that is covering this and it just isn't linked to this one?
 Is there really any reason to keep this and MAPREDUCE-443 open?

> Should be able to re-run jobs, collecting only missing output
> -------------------------------------------------------------
>                 Key: MAPREDUCE-460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-460
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>            Reporter: Bryan Pendleton
> For jobs with no side effects (roughly == jobs with speculative execution enabled), if
partial output has been generated, it should be possible to re-run the job, and fill in the
missing pieces. I have now run the same job twice, once finishing 42 of 44 reduce tasks, another
time finishing only 17. Each time, many nodes have failed, causing many many tasks to fail
( in one case, 5k failures from 15k map tasks, 23 failures from 44 reduces), but some valid
output was generated. Since the output is only dependent on the input, and both jobs used
the same input, I will now be able to combine these two failed task outputs to get a completed
job's output. This should be something that can be more automatic.
> In particular, it should be possible to resubmit a job, with a list of partitions that
should be ignored. A special Combiner, or pre-Combiner, would throw out any map output for
partitions that have already been successfully completed, thus reducing the amount of data
that needs to be reduced to complete the job. It would, of course, be nice to support "filling
in" existing outputs, rather than having to do a move operation on completed outputs.

This message was sent by Atlassian JIRA

View raw message