hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
Date Fri, 13 Sep 2013 07:07:53 GMT

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13766290#comment-13766290

Bikas Saha commented on YARN-540:

bq. Delete throws exception in case of not-existing
If that is the case, then why didnt this code in the previous patch cause an exception to
be thrown for a normal job? This is removing the app that should already have been removed
after unregister.
+      // application completely done and remove from state store.
+      // App state may be already removed during RMAppFinishingOrRemovingTransition.
+      RMStateStore store = app.rmContext.getStateStore();
+      store.removeApplication(app)

bq. it should not be possible to generate RMAppEventType.ATTEMPT_FAILED event at that state
Can the app crash while its waiting to be unregistered. Will that generate an ATTEMPT_FAILED?
Can the node crash and cause an ATTEMPT_FAILED. If yes, then these would be apply to the FINISHING
state also.

bq. In case of REMOVING, return YARNApplicationState as RUNNING, makes sense?
In general an app can be removed while its in ACCEPTED state also (kill app after submission)
These should also go through the REMOVING state. So its not necessary that the app state will
always be RUNNING. We probably need to save the previous state and return that while the app
is in REMOVING state.

> Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
> ----------------------------------------------------------------------------------------
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch,
YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.patch, YARN-540.patch
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher
is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload
the existing state files even though the job is succeeded

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message