hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-128) Resurrect RM Restart
Date Mon, 24 Sep 2012 21:12:09 GMT

    [ https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462141#comment-13462141
] 

Robert Joseph Evans commented on YARN-128:
------------------------------------------

bq. AMs should not finish themselves while the RM is down or recovering. They should just
spin.

+1 for that.  If we let the MR AM finish, and then the RM comes up and tries to restart it
will get confused because it will not find the job history log where it expects to see it
which will cause it to restart, and it is likely to find the output directory already populated
with data, which could cause the job to fail.  What is worse it may not fail, because I think
the output committer will ignore those errors. The first AM could inform oozie that the job
finished through a callback, and a second job may be launched and is reading the data at the
time that the restarted first job is trying to write that data, which could cause inconsistent
results or cause the second job to fail somewhat randomly. 

bq. An upper bound (time) on recovery?

This is a bit difficult to determine because the RM is responsible for renewing tokens.  Right
now it will renew them when they only have about 10% of their time left before they expire.
 So it depends on how long the shortest token you have in flight is valid for before it needs
to be renewed.  In general all of the tokens I have seen are for 24 hours, so you would have
about 2.4 hours to bring the RM back up and read in/start renewing all of the tokens or risk
tokens expiring.  
                
> Resurrect RM Restart 
> ---------------------
>
>                 Key: YARN-128
>                 URL: https://issues.apache.org/jira/browse/YARN-128
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Bikas Saha
>         Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message