zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tao Xiao <xiaotao.cs....@gmail.com>
Subject Re: zookeeper works well but log reports "Cannot open channel to 2 at election address ... java.net.ConnectException: Connection refused"
Date Fri, 14 Feb 2014 09:55:37 GMT
Hi German,

Thanks for your reply.

I checked logs on all the three nodes for the two clusters(the *good
cluster* where  I can query the status through "bin/zkServer.sh status" and
the* bad cluster* where executing the same command would report "Error
contacting service. It is probably not running.") respectively.


*The 1st and 2nd nods of both clusters have the following warnings in their
logs:*
.... ....
2014-02-11 15:42:15,421 [myid:2] - WARN
 [WorkerSender[myid=2]:QuorumCnxManager@368] - Cannot open channel to 3 at
election address imon-3/172.16.38.145:3888
java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
    at java.net.Socket.connect(Socket.java:529)
    at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
    at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
    at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
    at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
    at java.lang.Thread.run(Thread.java:662)
.... ....
.... ....
2014-02-11 17:37:43,141 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x2441fe57a7e000e, likely client has closed socket
    at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
    at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    at java.lang.Thread.run(Thread.java:662)
.... ....


*The 3rd node of the good cluster has only one kind of warnings as follows:*
... ...
2014-02-13 17:02:36,005 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to establish
new session at /172.16.16.170:2345
2014-02-13 17:02:36,038 [myid:3] - INFO
 [CommitProcessor:3:ZooKeeperServer@595] - Established session
0x3440a225c00002d with negotiated timeout 40000 for client /
172.16.16.170:2345
2014-02-14 04:20:19,622 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x3440a225c000026, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:662)
2014-02-14 04:20:19,623 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /172.16.66.134:55513 which had sessionid 0x3440a225c000026
2014-02-14 10:04:02,661 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of
session 0x3440a225c00002d due to java.io.IOException: Connection reset by
peer
2014-02-14 10:04:02,662 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /172.16.16.170:2345 which had sessionid 0x3440a225c00002d
2014-02-14 16:52:16,837 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection
from /127.0.0.1:42057
2014-02-14 16:52:16,874 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@821] - Processing stat command from /
127.0.0.1:42057
2014-02-14 16:52:16,896 [myid:3] - INFO
 [Thread-3:NIOServerCnxn$StatCommand@655] - Stat command output
2014-02-14 16:52:16,897 [myid:3] - INFO  [Thread-3:NIOServerCnxn@1001] -
Closed socket connection for client /127.0.0.1:42057 (no session
established for client)
... ...


*The 3rd node of the bad cluster has no warnings at all.*

I checked the status for each of the servers, for both clusters
respectively. Executing the command "bin/zkServer.sh status" reported "Error
contacting service. It is probably not running." on each of the node of the
bad cluster, whereas it reported "mode: follower/leader" on each node of
the good cluster.


I executed the command "echo stat | nc 127.0.0.1 2181" on each node of the
two clusters. All nodes printed the information similar to the following:
Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Clients:
 /172.16.38.145:35045[1](queued=0,recved=19750,sent=19750)
 /127.0.0.1:54713[0](queued=0,recved=1,sent=0)
 /172.16.38.144:35483[1](queued=0,recved=19788,sent=19797)
 /172.16.38.144:47445[1](queued=0,recved=1861,sent=1861)
Latency min/avg/max: 0/0/68
Received: 42751
Sent: 42882
Connections: 4
Outstanding: 0
Zxid: 0x20000013d
Mode: follower(/leader)
Node count: 27



Besides, I looked for lines containing "TOOK" in the logs, and found that
the log file on each node of the two clusters has only one line containing
"TOOK". For all the nodes, there are other logging info following the line
containing "TOOK", and I'm not sure how to confirm whether the leader
election was successful.
2014-02-07 10:17:08,311 [myid:2] - INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@162] - Created
server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir /storage/disk1/zookeeper/data.dir/version-2 snapdir
/storage/disk1/zookeeper/data.dir/version-2
2014-02-07 10:17:08,312 [myid:2] - INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@345] - LEADING - LEADER
ELECTION TOOK - 610
2014-02-07 10:17:08,320 [myid:2] - INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
/storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00053d5d
2014-02-07 10:17:08,749 [myid:2] - INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
Snapshotting: 0x1f00063340 to
/storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00063340
2014-02-07 10:17:09,222 [myid:2] - INFO
 [LearnerHandler-/172.16.66.129:39195:LearnerHandler@263] - Follower sid: 1
: info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@5ddc0e7a
2014-02-07 10:17:09,347 [myid:2] - INFO
 [LearnerHandler-/172.16.66.129:39195:LearnerHandler@318] - Synchronizing
with Follower sid: 1 maxCommittedLog=0x1f00063340
minCommittedLog=0x1f0006314c peerLastZxid=0x1f00063340
2014-02-07 10:17:09,348 [myid:2] - INFO
 [LearnerHandler-/172.16.66.129:39195:LearnerHandler@395] - Sending DIFF
2014-02-07 10:17:09,396 [myid:2] - WARN
 [LearnerHandler-/172.16.66.129:39195:Leader@574] - Commiting zxid
0x2000000000 from /172.16.66.130:2888 not first!
2014-02-07 10:17:09,397 [myid:2] - WARN
 [LearnerHandler-/172.16.66.129:39195:Leader@576] - First is 0x0


Hope the info above can help you find the reason. Thanks!


Tao


2014-02-13 18:05 GMT+08:00 German Blanco <german.blanco.blanco@gmail.com>:

> I can´t explain it with this information.
> It could be e.g. that in the first cluster, you have a working ensemble of
> two of the nodes and the "bin/zkServer.sh status" is trying to connect to
> the one that doesn´t work.
> I would start by checking the status of each of the servers (not just one).
> Try out the 4 letter words commands:
> https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands
> They give a lot of information.
> Also, look in the server logs for the last successful leader election (try
> grep for "TOOK" to do this) and the last election started (try grep for
> "New election"). If the last thing in the log is a "TOOK" then things
> should be fine, if it is a "New election" then there might be a problem.
> Then please post what you find out.
>
> Regards,
>
> German.
>
>
> On Thu, Feb 13, 2014 at 7:18 AM, Tao Xiao <xiaotao.cs.nju@gmail.com>
> wrote:
>
> > Hi German,
> >
> > I installed ZooKeeper in another cluster and it works well. I mean I can
> > query the status through "bin/zkServer.sh status", how to explain that?
> >
> >
> > 2014-02-11 18:12 GMT+08:00 German Blanco <german.blanco.blanco@gmail.com
> >:
> >
> > > Hello,
> > >
> > > It doesn't matter.
> > > The reason is that each zookeeper servers upon start-up try to
> establish
> > > connections with all other servers in its configuration.
> > > However, only one connection for pair should be there, so half of the
> > > connections are dropped.
> > > Right after connection servers send their ids. If the id of the peer
> > > initiating the connection is higher than the peer receiving the
> > connection,
> > > then everything proceeds, otherwise connection is rejected.
> > > Why do peers with a lower id try to open connections with the peers
> with
> > a
> > > higher id then?
> > > Because in that way, they trigger a connection attempt in the other
> > > direction. That is, say servers 2 and 3 have a working ensemble, they
> are
> > > not going to attempt to connect to server 1 unless something happens.
> > > Server 1 wakes up, it attempts to connect to servers 2 and 3 and these
> > two
> > > connection attempts fail, but each triggers a connection attempt from
> the
> > > respective peer that succeeds.
> > > This was a bit more than you asked for, but anyway I hope it helps :-).
> > If
> > > the answer doesn't work for you, please let me know.
> > >
> > > Best regards,
> > >
> > > German.
> > >
> > >
> > > On Tue, Feb 11, 2014 at 10:53 AM, Tao Xiao <xiaotao.cs.nju@gmail.com>
> > > wrote:
> > >
> > > > I installed zookeeper 3.4.5 in a 3-node cluster and started it. I
> think
> > > > zookeeper works well because the HBase cluster, which relies on
> > > zookeeper,
> > > > indeed works well. But when I tried to query zookeeper's status, it
> > > > reported:
> > > >
> > > > [root@imon-1 zookeeper-3.4.5]# bin/zkServer.sh status
> > > > JMX enabled by default
> > > > Using config: /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg
> > > > Error contacting service. It is probably not running.
> > > >
> > > >
> > > > I checked the log and found it reported the following:
> > > >
> > > > 2014-02-11 15:42:15,623 [myid:] - INFO  [main:QuorumPeerConfig@101]
> -
> > > > Reading configuration from:
> > > > /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg
> > > > 2014-02-11 15:42:15,629 [myid:] - INFO  [main:QuorumPeerConfig@334]
> -
> > > > Defaulting to majority quorums
> > > > 2014-02-11 15:42:15,652 [myid:1] - INFO
>  [main:DatadirCleanupManager@78
> > ]
> > > -
> > > > autopurge.snapRetainCount set to 3
> > > > 2014-02-11 15:42:15,652 [myid:1] - INFO
>  [main:DatadirCleanupManager@79
> > ]
> > > -
> > > > autopurge.purgeInterval set to 0
> > > > 2014-02-11 15:42:15,653 [myid:1] - INFO
> >  [main:DatadirCleanupManager@101
> > > ]
> > > > -
> > > > Purge task is not scheduled.
> > > > 2014-02-11 15:42:15,683 [myid:1] - INFO  [main:QuorumPeerMain@127] -
> > > > Starting quorum peer
> > > > 2014-02-11 15:42:15,726 [myid:1] - INFO
>  [main:NIOServerCnxnFactory@94
> > ]
> > > -
> > > > binding to port 0.0.0.0/0.0.0.0:2181
> > > > 2014-02-11 15:42:15,759 [myid:1] - INFO  [main:QuorumPeer@913] -
> > > tickTime
> > > > set to 2000
> > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@933] -
> > > > minSessionTimeout set to -1
> > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@944] -
> > > > maxSessionTimeout set to -1
> > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@959] -
> > > initLimit
> > > > set to 10
> > > > 2014-02-11 15:42:15,874 [myid:1] - INFO  [main:FileSnap@83] -
> Reading
> > > > snapshot /var/data/zk/dataDir/version-2/snapshot.100072a68
> > > > 2014-02-11 15:42:18,211 [myid:1] - INFO
> > > >  [Thread-1:QuorumCnxManager$Listener@486] - My election bind port:
> > > > 0.0.0.0/0.0.0.0:3888
> > > > 2014-02-11 15:42:18,226 [myid:1] - INFO  [QuorumPeer[myid=1]/
> > > 0.0.0.0:2181
> > > > :QuorumPeer@670] - LOOKING
> > > > 2014-02-11 15:42:18,232 [myid:1] - INFO  [QuorumPeer[myid=1]/
> > > 0.0.0.0:2181
> > > > :FastLeaderElection@740] - New election. My id =  1, proposed
> > > > zxid=0x1000735ac
> > > > 2014-02-11 15:42:18,234 [myid:1] - INFO
> > > >  [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1
> > > > (n.leader), 0x1000735ac (n.zxid), 0x1 (n.round), LOOKING (n.state), 1
> > > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
> > > > 2014-02-11 15:42:18,256 [myid:1] - WARN
> > > >  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel
> to
> > 2
> > > at
> > > > election address imon-2/172.16.38.144:3888
> > > > java.net.ConnectException: Connection refused
> > > >         at java.net.PlainSocketImpl.socketConnect(Native Method)
> > > >         at
> java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> > > >         at
> > > > java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> > > >         at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> > > >         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> > > >         at java.net.Socket.connect(Socket.java:529)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
> > > >         at java.lang.Thread.run(Thread.java:662)
> > > > 2014-02-11 15:42:18,268 [myid:1] - WARN
> > > >  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel
> to
> > 3
> > > at
> > > > election address imon-3/172.16.38.145:3888
> > > > java.net.ConnectException: Connection refused
> > > >         at java.net.PlainSocketImpl.socketConnect(Native Method)
> > > >         at
> java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> > > >         at
> > > > java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> > > >         at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> > > >         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> > > >         at java.net.Socket.connect(Socket.java:529)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
> > > >         at java.lang.Thread.run(Thread.java:662)
> > > > 2014-02-11 15:42:18,436 [myid:1] - WARN  [QuorumPeer[myid=1]/
> > > 0.0.0.0:2181
> > > > :QuorumCnxManager@368] - Cannot open channel to 2 at election
> address
> > > > imon-2/172.16.38.144:3888
> > > > java.net.ConnectException: Connection refused
> > > >
> > > > ... ...
> > > >
> > > >
> > > > Do these warnings matter?  What is the reason.
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message