hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
Date Thu, 05 Sep 2013 20:29:54 GMT

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759412#comment-13759412

Jason Lowe commented on YARN-540:

bq.  Then RM will also need to somehow remember that unregister came in but the state-store
app removal isn't done. Which is not possible without more state-store writes?

Argh, right I forgot.  It will simply see the container exit but not understand the context
of that exit and misinterpret it as a crash and recover scenario.  Darn, I thought we had
it.  :-)

I think the existing unregister call should be blocking from the AMs perspective, as that's
the simplest and most-compatible way to fix it.  We could always add an asynchronous form
of that API later.  If most AMs are expected to communicate through a wrapper layer where
we can hide this behavior then that's probably fine too -- RM and low-level API could be async
but most AMs still see it as a blocking call.

Part of the issue of making it async is at some point we need to have some flow control. 
If apps are churning faster than we can persist them then there's going to be issues (backup
of store dispatcher queue, etc.).  At some point we have to block something.
> Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
> ----------------------------------------------------------------------------------------
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.patch,
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher
is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload
the existing state files even though the job is succeeded

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message