hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
Date Sun, 25 Aug 2013 20:15:52 GMT

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749734#comment-13749734

Jian He commented on YARN-540:

bq. What will happen if the RM failed after deleting the app from the store but before the
app pulled that information from the RM? 
App will not fail because RM unregister is ignoring any exceptions coming from finishApp().
JobClient can also get the final status of the App regardless wether finishApp() fails or
bq. The state transitions are asynchronous. We cannot expect to always find the app in the
FINISHING state is the only state after unregister call happens that we can reliably say app
is removed from state store  depending on currently implemented state transitions. Tell me
if I missed something.
bq. Can the application finish on the RM (in between 2 finishApp() requests) such that it
never gets a true response?
Application will not go to FINISHED state unless AM process exists or AM expires. So I think
it can reliably get the true response as long as RM is available.
bq. Is this possible to avoid 2 round trips to store?
Are you saying is the following code possible to handle duplicative APP_REMOVE events?
bq. There is no need for multiple code paths/transitions.
I in fact noticed this while writing the patch, the intention was to avoid the unnecessary
overhead trip to RMStateStore. thoughts?

Agree with other comments, will post a new patch soon.

> Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
> ----------------------------------------------------------------------------------------
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, YARN-540.patch
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher
is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload
the existing state files even though the job is succeeded

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message