hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lin Yiqun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12680) Loss of zookeeper quorum lead all the namenode to be standby state
Date Fri, 25 Dec 2015 08:12:49 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071414#comment-15071414
] 

Lin Yiqun commented on HADOOP-12680:
------------------------------------

I show the some of zkfc log in this case:
{code}
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering
neutral mode and rejoining...
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish
ZK session
2015-12-24 17:33:43,875 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181
sessionTimeout=30000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@56d70b02
2015-12-24 17:33:43,884 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server 10.13.8.25/10.13.8.25:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:43,884 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:43,905 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper
service, session 0x451703dcdf7d107 has expired, closing socket connection
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server 10.13.7.33/10.13.7.33:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.13.7.33/10.13.7.33:2181, initiating session
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data
from server sessionid 0x0, likely server has closed socket, closing socket connection and
attempting reconnect
2015-12-24 17:33:44,712 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server 10.13.8.24/10.13.8.24:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:44,712 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:45,806 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server 10.13.8.26/10.13.8.26:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.13.8.26/10.13.8.26:2181, initiating session
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data
from server sessionid 0x0, likely server has closed socket, closing socket connection and
attempting reconnect
2015-12-24 17:33:46,549 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server 10.13.8.27/10.13.8.27:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:46,550 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:46,561 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server 10.13.8.27/10.13.8.27:2181, sessionid = 0x451d35639b5002a, negotiated timeout =
30000
2015-12-24 17:33:46,563 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2015-12-24 17:33:46,564 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2015-12-24 17:33:46,573 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated
that NameNode at qihe2192/10.12.2.192:9000 should become standby
2015-12-24 17:33:46,575 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned
NameNode at qihe2192/10.12.2.192:9000 to standby state
2015-12-24 17:47:21,517 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception
trying to monitor health of NameNode at qihe2192/10.12.2.192:9000: java.io.IOException: Connection
reset by peer Failed on local exception: java.io.IOException: Connection reset by peer; Host
Details : local host is: "qihe2192/10.12.2.192"; destination host is: "qihe2192":9000;
{code}
{code}
2015-12-24 17:33:44,860 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper
service, session 0x551703eef8b00c2 has expired, closing socket connection
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering
neutral mode and rejoining...
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish
ZK session
2015-12-24 17:33:44,862 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181
sessionTimeout=30000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5eefe70b
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server 10.13.8.27/10.13.8.27:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:44,871 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server 10.13.8.27/10.13.8.27:2181, sessionid = 0x451d35639b50012, negotiated timeout =
30000
2015-12-24 17:33:44,873 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2015-12-24 17:33:44,874 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2015-12-24 17:33:44,892 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated
that NameNode at qihe2182/10.12.2.182:9000 should become standby
2015-12-24 17:33:44,928 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned
NameNode at qihe2182/10.12.2.182:9000 to standby state
2015-12-24 17:47:20,883 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception
trying to monitor health of NameNode at qihe2182/10.12.2.182:9000: java.io.IOException: Connection
reset by peer Failed on local exception: java.io.IOException: Connection reset by peer; Host
Details : local host is: "qihe2182/10.12.2.182"; destination host is: "qihe2182":9000;
2015-12-24 17:47:20,883 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
{code}
In {{2015-12-24 17:33}}, namenode are all transitioned to standby state.

> Loss of zookeeper quorum lead all the namenode to be standby state
> ------------------------------------------------------------------
>
>                 Key: HADOOP-12680
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12680
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.7.1
>            Reporter: Lin Yiqun
>
> When I am upgrading my zookeeper cluster, and will change the ip address of zk nodes.
And I found two namenodes of my hadoop cluster got loss of connection with zk. And when I
revocer the zk cluster, the two namenodes are both transitioned to standby state and this
makes cluster can't provide service. I found the reason may be is following:
> {code}
> /**
>      * If the elector gets disconnected from Zookeeper and does not know about
>      * the lock state, then it will notify the service via the enterNeutralMode
>      * interface. The service may choose to ignore this or stop doing state
>      * changing operations. Upon reconnection, the elector verifies the leader
>      * status and calls back on the becomeActive and becomeStandby app
>      * interfaces. <br/>
>      * Zookeeper disconnects can happen due to network issues or loss of
>      * Zookeeper quorum. Thus enterNeutralMode can be used to guard against
>      * split-brain issues. In such situations it might be prudent to call
>      * becomeStandby too. However, such state change operations might be
>      * expensive and enterNeutralMode can help guard against doing that for
>      * transient issues.
>      */
>     void enterNeutralMode();
> {code}
> May be we should create a thread to monitor the stat of namenodes and don't let them
all to be standby state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message