hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Thu, 29 Nov 2012 15:42:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506544#comment-13506544

Jason Lowe commented on MAPREDUCE-4819:

Again to me, it's all about the commit.  If we address that then I don't think the others
are all that critical since the commit occurs first and is crucial to not repeat.  The others
should be safe to repeat if necessary.

Once we checkpoint the fact that we committed, the rest can be recovered in a relatively straightforward
manner on subsequent attempts with the existing code -- we just skip past the commit and proceed
doing what we're already doing, setting final job status, performing job end notification,
unregistering, etc.  Job end notification is already a best-effort-but-not-guaranteed service,
and we can't avoid the potential for double-notifications.

If we think delaying reporting job success via job status RPC call until after the history
file is copied to done_intermediate is important (which I don't see being so since the commit
can still be repeated) then we can do that in another JIRA or in this one.  However this one
would still need to be fixed and is a very high priority.
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message