hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-540) RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event
Date Thu, 04 Apr 2013 19:20:16 GMT

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622681#comment-13622681
] 

Bikas Saha commented on YARN-540:
---------------------------------

This is a known issue. The problem here is that the rm state store is essentially a write
ahead log. But in the application unregister/finish case, the application has already finished
before the rm stores that fact in its state. So the RM by itself cannot avoid this problem.
Since its a race condition we may choose not not fix it unless we see this happen often in
practice.
The solutions that come to mind are
1) finishApplicationMaster() blocks until the finish is stored in the store. This has issues
of getting blocked on a slow/unavailable store. Also, the RM does a bunch of other things
before and application finishes. The RM may not be able to remove the application from the
store until all those steps are complete.
2) finishApplicationMaster() becomes a 2-step process in which, in the second step the app
waits for the RM to change the app's state to "FINISHED" before exiting.
                
> RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped
before it can process REMOVE_APP event
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher
is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload
the existing state files even though the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message