hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
Date Wed, 11 Sep 2013 13:32:52 GMT

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13764287#comment-13764287
] 

Jason Lowe commented on YARN-540:
---------------------------------

bq. The solution is to not report success to user until services have stopped.

Note that delaying reporting success to downstream consumers isn't always possible, as success
can be reported via other means than JobClient directly.  For example, the _SUCCESS file written
as part of FileOutputCommitter's commit processing indicates to others that the job succeeded.
 IIRC Oozie can poll for this as part of determining whether a job succeeded.  I suspect other
committers have their own methods of notifying downstream consumers that the job succeeded.
 And we shouldn't be unregistering from the RM before committing.

As such I think there will always be races where the YARN and MR app states can end up inconsistent
because a job could notify others of success and then fail before it can notify YARN.  We
may still want to delay reporting success to JobClient, but I don't think it completely solves
the issue.
                
> Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch,
YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher
is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload
the existing state files even though the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message