hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
Date Tue, 10 Sep 2013 01:43:53 GMT

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762599#comment-13762599
] 

Bikas Saha commented on YARN-540:
---------------------------------

bq. Exception because unManagedAM attempt will be immediately removed from the responseMap
Havent looked at the patch yet, but this sounds like a race condition waiting to happen in
other cases. Lets say the first unregister returns false. Now someone kills the app and the
app goes through the transition that removes it from the responseMap. Now if the AM comes
back with the second unregister, should it fail or succeed.

The key question here is whether an AM is done after it calls unregister. If the unregister
fails, then is the AM expected to considered failing itself or continuing as if it has succeeded?
                
> Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch,
YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher
is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload
the existing state files even though the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message