hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Tue, 27 Nov 2012 14:43:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504646#comment-13504646

Jason Lowe commented on MAPREDUCE-4819:

bq. Maybe final client notification should be the last thing after all post processing is

No, moving the client notification later just creates a different set of problems, like the
client never being notified *at all* because the AM crashes after unregistering with the RM
but before it notifies the client.  The RM won't restart the app because it unregistered successfully,
but the client is never notified.

bq. In general it seems like we need to come up with a set of markers that previous AM's leave
behind that can tell the next retry if the previous one failed/succeeded and so the current
AM should exit or continue to run.

Exactly, and the AM is already doing this in the job history file which is how it helps supports
recovery.  We should extend this so that even if the output committer doesn't support recovery
the AM will check for markers in the job history file and prevent the job from executing tasks
and committing output if final job status has been determined by previous attempts.  That
way we prevent the AM from re-committing job output or changing the final job status after
notifying the client.  We just need to make sure those markers are flushed to persistent store
and located properly by future AM attempts before attempting to notify the client.  If subsequent
attempts see the final job status marker then they should skip straight to the client notification
process instead of running tasks.

> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message