hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
Date Thu, 05 Sep 2013 20:07:54 GMT

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759384#comment-13759384

Jason Lowe commented on YARN-540:

Unless I'm missing something, it does require a behavior change on the AM to recover.  Here's
the scenario:

# AM unregisters with RM, RM asynchronously schedules removal of the app from the store but
returns from the call before this completes
# RM crashes before app removed from persistent state store
# AM proceeds to clean up, remove the staging directory, and exit (i.e.: no behavior change
from what AMs do today after unregistering)
# RM restarts with the persistent state store showing the app as running (i.e.: it missed
the fact that it unregistered)
# Without work-preserving restart, the RM will try to launch a new app attempt but the attempt
(and therefore app) will be reported as failing because there's no staging directory.  With
work-preserving restart, it will wait up until the AM expiry interval for the original attempt
to report in and then it will launch a new attempt to try to recover which fails the attempt
and ultimately the app.

I don't see how the old AM is going to report back into the RM after unregistering without
a behavior change on the AM side.  Normally AMs cleanup and leave shortly after unregistering
without trying to report back to the RM.  This change narrows the race condition window, but
the window can be larger than expected if the state store dispatcher is running behind because
of a slow store backend.
> Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
> ----------------------------------------------------------------------------------------
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.patch,
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher
is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload
the existing state files even though the job is succeeded

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message