hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4157) ResourceManager should not kill apps that are well behaved
Date Tue, 19 Jun 2012 20:12:43 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397033#comment-13397033

Jason Lowe commented on MAPREDUCE-4157:

It's not part of MAPREDUCE-3614, rather it came out of the work on MAPREDUCE-4099.  When the
AM unregisters with the RM, there's a race between the AM finishing normally on its own and
the RM killing the AM as part of killing all containers for the application.  If the AM is
performing cleanup duties that aren't critical to the success/failure of the application then
it would be nice if the AM was given time to do this before the RM kills it as a side-effect
of the unregister.

The AM could move the cleanup to before the unregister, but if the AM fails/dies/hangs during
the cleanup the RM will attempt to restart the AM thinking the job did not complete successfully
even though the client has already been notified of the success.  And if the staging directory
was removed as part of the cleanup, restarting will fail and the job will be marked by the
RM as failed but the client thought it succeeded.

This change doesn't eliminate all of the race conditions (the AM could fail after the client
is notified but before unregistering with the RM), but it does eliminate a race between the
AM shutting down cleanly and the RM trying to kill it.
> ResourceManager should not kill apps that are well behaved
> ----------------------------------------------------------
>                 Key: MAPREDUCE-4157
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4157
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 2.0.0-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-4157.patch
> Currently when the ApplicationMaster unregisters with the ResourceManager, the RM kills
(via the NMs) all the active containers for an application.  This introduces a race where
the AM may be trying to clean up and may not finish before it is killed.  The RM should give
the AM a chance to exit cleanly on its own rather than always race with a pending kill on

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message