hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Wed, 28 Nov 2012 19:39:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505831#comment-13505831

Bikas Saha commented on MAPREDUCE-4819:

Yeah. Got the same info from Vinod in an offline conversation.

Looks like the patch solves half the problem. Making sure that history is fully saved before
changing to succeeded state.
The other half is to make sure the recovery data is available to the restarted app.
Since the RM can restart FAILED/KILLED/SUCCEEDED apps, looks like we will need to wait for
state data to be saved for all of them and not just succeeded state (which is what the patch
does). Or else, the RM could restart a failed app which would run to again and fail again.

The solutions to the second half could be
1) dont delete the original in staging dirs. But this suffers from a problem that final staging
dir clean up would end up cleaning it for a successful app and then AM could crash
2) have recovery service look at both temp and done locations. But this suffers from race
conditions when the AM does a partial move to done dir and then dies. so part of the data
is on temp and part in done.
3) before moving from temp to done create a marker file in done. upon restart, check if marker
file exists. if it does then dont do anything because the job was done (failed/killed/successful)
and it died sometime after that.

> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message