hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neill Lima (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
Date Mon, 09 Feb 2015 09:39:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311993#comment-14311993
] 

Neill Lima commented on YARN-3152:
----------------------------------

Hello [~Naganarasimha]], that seems to be a good idea. In this last scenario you described
(standby with missing exclude file), I have two questions:

- The exclude file was present during HA startup and then it was removed somehow? Otherwise
the HA startup should fail like it happened on my case.
- During the transition from standby -> active, would the RM keep pooling the filesystem
looking for the exclude file and once file is in place, the transition resumes properly? 

* A visual aid on the RM web ui would be quite handy as well, because any Dev/Ops would look
for the "<RM_IP>:8088/cluster" page checking for the job startup / schedule. 

> Missing hadoop exclude file fails RMs in HA
> -------------------------------------------
>
>                 Key: YARN-3152
>                 URL: https://issues.apache.org/jira/browse/YARN-3152
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>         Environment: Debian 7
>            Reporter: Neill Lima
>            Assignee: Naganarasimha G R
>
> NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude).
I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point
as well. I applied the HA RM settings properly and when I started both RMs I started getting
this exception:
> 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=root	OPERATION=transitionToActive	TARGET=RMHAProtocolService	RESULT=FAILURE	DESCRIPTION=Exception
transitioning to active	PERMISSIONS=All users are allowed
> 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling
the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> 	at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
> 	at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
> 	at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active
mode
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
> 	at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> 	... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException:
/hadoop-2.6.0/etc/hadoop/exclude (No such file or directory)
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
> 	... 5 more
> 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish
ZK session
> 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094
closed
> 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
> 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error)
> 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to x.x.x.x/x.x.x.x:2181, initiating session
> The issue is descriptive enough to resolve the problem - and it has been fixed by creating
the exclude file. 
> I just think as of a improvement: 
> - Should RMs ignore the missing file as the NNs did?
> - Should single RM fail even when the file is not present?
> Just suggesting this improvement to keep the behavior consistent when working with in
HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message