hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rakesh R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3477) FormatZK and ZKFC startup can fail due to zkclient connection establishment delay
Date Tue, 20 Nov 2012 18:30:59 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501358#comment-13501358

Rakesh R commented on HDFS-3477:

Sorry for the late reply. Yeah, you are absolutely correct. I had missed the case of receiving
multiple events in between wait completion and watcher registration. 

I also could see a very corner case like: 
Recevived the ZK connection event, immediately before the watcher registration, say ZK got
Disconnected the event will be received to the old watcher and will never execute this event.
In this case would be still in Synconnected and could go to an inconsistent state. I have
tried this scenario in my cluster, but not able to re-produce the same, finally I just tried
keeping debug points and has re-produced. Your approach of queuingup the events is fine and
this is executing all the events sequentially without missing anything.
> FormatZK and ZKFC startup can fail due to zkclient connection establishment delay
> ---------------------------------------------------------------------------------
>                 Key: HDFS-3477
>                 URL: https://issues.apache.org/jira/browse/HDFS-3477
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: auto-failover
>    Affects Versions: 2.0.0-alpha
>            Reporter: suja s
>            Assignee: Rakesh R
>         Attachments: HDFS-3477.1.patch, HDFS-3477.2.patch, HDFS-3477.3.patch, HDFS-3477.3.patch,
HDFS-3477.patch, hdfs-3477.txt, hdfs-3477.txt
> Format and ZKFC startup flows continue further after creation of zkclient connection
without waiting to check whether the connection is completely established. This  leads to
failure at the subsequent point if connection was not complete by then.
> Exception trace for format 
> {noformat}
> 12/05/30 19:48:24 INFO zookeeper.ClientCnxn: Socket connection established to HOST-xx-xx-xx-55/xx.xx.xx.55:2182,
initiating session
> 12/05/30 19:48:24 INFO zookeeper.ClientCnxn: Session establishment complete on server
HOST-xx-xx-xx-55/xx.xx.xx.55:2182, sessionid = 0x1379da4660c0014, negotiated timeout = 5000
> 12/05/30 19:48:24 WARN ha.ActiveStandbyElector: Ignoring stale result from old client
with sessionId 0x1379da4660c0014
> 12/05/30 19:48:24 INFO zookeeper.ZooKeeper: Session: 0x1379da4660c0014 closed
> 12/05/30 19:48:24 INFO zookeeper.ClientCnxn: EventThread shut down
> Exception in thread "main" java.io.IOException: Couldn't determine existence of znode
>         at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:263)
>         at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:257)
>         at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:195)
>         at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
>         at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:163)
>         at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:159)
>         at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
>         at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:159)
>         at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:171)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /hadoop-ha/hacluster
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049)
>         at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:261)
>         ... 8 more
> {noformat}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message