zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tao Xiao <xiaotao.cs....@gmail.com>
Subject Re: zookeeper works well but log reports "Cannot open channel to 2 at election address ... java.net.ConnectException: Connection refused"
Date Sat, 15 Feb 2014 01:59:33 GMT
Hi German,

I confirmed that the command "echo stat | nc 127.0.0.1 2181" reported "Mode:
follower(/leader)" for each node in the bad cluster,  but the command
"bin/zkServer.sh
status" doesn't work.

For both the good and bad clusters, the line containing "Have quorum of
supporters" only appears in the log file of  2nd node(the leader for both
clusters respectively).

I can see something like "INFO  [main:NIOServerCnxnFactory@94] - binding to
port 0.0.0.0/0.0.0.0:2181" for each node of the two clusters.

The command "ping localhost" reported "PING localhost (202.102.110.203)
56(84) bytes of data." in the bad cluster, indicating that the DNS
resolution of localhost is not correct. I checked the file /etc/hosts on
each node of the bad cluster and found they all lack the line "127.0.0.1
localhost.localcertus.com  localhost". In the good cluster, all files
/etc/hosts do contain that line and the command  "ping localhost" correctly
reported "PING localhost.localcertus.com (127.0.0.1) 56(84) bytes of data."

*I suppose that is the very reason why the command "bin/zkServer.sh
status" doesn't work in the bad cluster. *

So I added that line to /etc/hosts and the command "ping localhost" correctly
reported "PING localhost.localcertus.com (127.0.0.1) 56(84) bytes of data."


Now the command "bin/zkServer.sh status" works on each node.

Thanks German, thank you for your help.


Best regards,
Tao.



2014-02-14 20:07 GMT+08:00 German Blanco <german.blanco.blanco@gmail.com>:

> Hello Tao,
>
> For the reason of the logs "Cannot open channel to ..." see my first reply.
> I believe they are not relevant for your problem, so let's forget about
> them for now.
>
> You are saying in your mail that command "echo stat | nc 127.0.0.1 2181"
> reports "Mode: follower(/leader)" also in the *bad cluster*?
> That is very strange. If that is the case, then "bin/zkServer.sh status"
> should also work ok. Please confirm.
>
> If you found the line TOOK in the bad cluster, that means leader election
> was successful.
>
> Next thing that should be there (most likely a few lines below) is this
> (only in the leader):
> "Have quorum of supporters ..."
>
> Do you see something like this?
> "[main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181"
> (grep "binding to port")
>
> Could you also please check that you have a normal DNS resolution of
> localhost in the bad cluster?
>
> Regards,
>
> German.
>
>
> On Fri, Feb 14, 2014 at 10:55 AM, Tao Xiao <xiaotao.cs.nju@gmail.com>
> wrote:
>
> > Hi German,
> >
> > Thanks for your reply.
> >
> > I checked logs on all the three nodes for the two clusters(the *good
> > cluster* where  I can query the status through "bin/zkServer.sh status"
> and
> > the* bad cluster* where executing the same command would report "Error
> > contacting service. It is probably not running.") respectively.
> >
> >
> > *The 1st and 2nd nods of both clusters have the following warnings in
> their
> > logs:*
> > .... ....
> > 2014-02-11 15:42:15,421 [myid:2] - WARN
> >  [WorkerSender[myid=2]:QuorumCnxManager@368] - Cannot open channel to 3
> at
> > election address imon-3/172.16.38.145:3888
> > java.net.ConnectException: Connection refused
> >     at java.net.PlainSocketImpl.socketConnect(Native Method)
> >     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> >     at
> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> >     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> >     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> >     at java.net.Socket.connect(Socket.java:529)
> >     at
> >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
> >     at
> >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
> >     at
> >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
> >     at
> >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
> >     at java.lang.Thread.run(Thread.java:662)
> > .... ....
> > .... ....
> > 2014-02-11 17:37:43,141 [myid:2] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x2441fe57a7e000e, likely client has closed socket
> >     at
> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> >     at
> >
> >
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> >     at java.lang.Thread.run(Thread.java:662)
> > .... ....
> >
> >
> > *The 3rd node of the good cluster has only one kind of warnings as
> > follows:*
> > ... ...
> > 2014-02-13 17:02:36,005 [myid:3] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to
> establish
> > new session at /172.16.16.170:2345
> > 2014-02-13 17:02:36,038 [myid:3] - INFO
> >  [CommitProcessor:3:ZooKeeperServer@595] - Established session
> > 0x3440a225c00002d with negotiated timeout 40000 for client /
> > 172.16.16.170:2345
> > 2014-02-14 04:20:19,622 [myid:3] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x3440a225c000026, likely client has closed socket
> >         at
> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> >         at
> >
> >
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> >         at java.lang.Thread.run(Thread.java:662)
> > 2014-02-14 04:20:19,623 [myid:3] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /172.16.66.134:55513 which had sessionid 0x3440a225c000026
> > 2014-02-14 10:04:02,661 [myid:3] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of
> > session 0x3440a225c00002d due to java.io.IOException: Connection reset by
> > peer
> > 2014-02-14 10:04:02,662 [myid:3] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /172.16.16.170:2345 which had sessionid 0x3440a225c00002d
> > 2014-02-14 16:52:16,837 [myid:3] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> > connection
> > from /127.0.0.1:42057
> > 2014-02-14 16:52:16,874 [myid:3] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@821] - Processing stat command from /
> > 127.0.0.1:42057
> > 2014-02-14 16:52:16,896 [myid:3] - INFO
> >  [Thread-3:NIOServerCnxn$StatCommand@655] - Stat command output
> > 2014-02-14 16:52:16,897 [myid:3] - INFO  [Thread-3:NIOServerCnxn@1001] -
> > Closed socket connection for client /127.0.0.1:42057 (no session
> > established for client)
> > ... ...
> >
> >
> > *The 3rd node of the bad cluster has no warnings at all.*
> >
> > I checked the status for each of the servers, for both clusters
> > respectively. Executing the command "bin/zkServer.sh status" reported
> > "Error
> > contacting service. It is probably not running." on each of the node of
> the
> > bad cluster, whereas it reported "mode: follower/leader" on each node of
> > the good cluster.
> >
> >
> > I executed the command "echo stat | nc 127.0.0.1 2181" on each node of
> the
> > two clusters. All nodes printed the information similar to the following:
> > Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
> > Clients:
> >  /172.16.38.145:35045[1](queued=0,recved=19750,sent=19750)
> >  /127.0.0.1:54713[0](queued=0,recved=1,sent=0)
> >  /172.16.38.144:35483[1](queued=0,recved=19788,sent=19797)
> >  /172.16.38.144:47445[1](queued=0,recved=1861,sent=1861)
> > Latency min/avg/max: 0/0/68
> > Received: 42751
> > Sent: 42882
> > Connections: 4
> > Outstanding: 0
> > Zxid: 0x20000013d
> > Mode: follower(/leader)
> > Node count: 27
> >
> >
> >
> > Besides, I looked for lines containing "TOOK" in the logs, and found that
> > the log file on each node of the two clusters has only one line
> containing
> > "TOOK". For all the nodes, there are other logging info following the
> line
> > containing "TOOK", and I'm not sure how to confirm whether the leader
> > election was successful.
> > 2014-02-07 10:17:08,311 [myid:2] - INFO
> >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@162] - Created
> > server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> > datadir /storage/disk1/zookeeper/data.dir/version-2 snapdir
> > /storage/disk1/zookeeper/data.dir/version-2
> > 2014-02-07 10:17:08,312 [myid:2] - INFO
> >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@345] - LEADING - LEADER
> > ELECTION TOOK - 610
> > 2014-02-07 10:17:08,320 [myid:2] - INFO
> >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading
> snapshot
> > /storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00053d5d
> > 2014-02-07 10:17:08,749 [myid:2] - INFO
> >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
> > Snapshotting: 0x1f00063340 to
> > /storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00063340
> > 2014-02-07 10:17:09,222 [myid:2] - INFO
> >  [LearnerHandler-/172.16.66.129:39195:LearnerHandler@263] - Follower
> sid:
> > 1
> > : info :
> > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@5ddc0e7a
> > 2014-02-07 10:17:09,347 [myid:2] - INFO
> >  [LearnerHandler-/172.16.66.129:39195:LearnerHandler@318] -
> Synchronizing
> > with Follower sid: 1 maxCommittedLog=0x1f00063340
> > minCommittedLog=0x1f0006314c peerLastZxid=0x1f00063340
> > 2014-02-07 10:17:09,348 [myid:2] - INFO
> >  [LearnerHandler-/172.16.66.129:39195:LearnerHandler@395] - Sending DIFF
> > 2014-02-07 10:17:09,396 [myid:2] - WARN
> >  [LearnerHandler-/172.16.66.129:39195:Leader@574] - Commiting zxid
> > 0x2000000000 from /172.16.66.130:2888 not first!
> > 2014-02-07 10:17:09,397 [myid:2] - WARN
> >  [LearnerHandler-/172.16.66.129:39195:Leader@576] - First is 0x0
> >
> >
> > Hope the info above can help you find the reason. Thanks!
> >
> >
> > Tao
> >
> >
> > 2014-02-13 18:05 GMT+08:00 German Blanco <german.blanco.blanco@gmail.com
> >:
> >
> > > I can´t explain it with this information.
> > > It could be e.g. that in the first cluster, you have a working ensemble
> > of
> > > two of the nodes and the "bin/zkServer.sh status" is trying to connect
> to
> > > the one that doesn´t work.
> > > I would start by checking the status of each of the servers (not just
> > one).
> > > Try out the 4 letter words commands:
> > >
> https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands
> > > They give a lot of information.
> > > Also, look in the server logs for the last successful leader election
> > (try
> > > grep for "TOOK" to do this) and the last election started (try grep for
> > > "New election"). If the last thing in the log is a "TOOK" then things
> > > should be fine, if it is a "New election" then there might be a
> problem.
> > > Then please post what you find out.
> > >
> > > Regards,
> > >
> > > German.
> > >
> > >
> > > On Thu, Feb 13, 2014 at 7:18 AM, Tao Xiao <xiaotao.cs.nju@gmail.com>
> > > wrote:
> > >
> > > > Hi German,
> > > >
> > > > I installed ZooKeeper in another cluster and it works well. I mean I
> > can
> > > > query the status through "bin/zkServer.sh status", how to explain
> that?
> > > >
> > > >
> > > > 2014-02-11 18:12 GMT+08:00 German Blanco <
> > german.blanco.blanco@gmail.com
> > > >:
> > > >
> > > > > Hello,
> > > > >
> > > > > It doesn't matter.
> > > > > The reason is that each zookeeper servers upon start-up try to
> > > establish
> > > > > connections with all other servers in its configuration.
> > > > > However, only one connection for pair should be there, so half of
> the
> > > > > connections are dropped.
> > > > > Right after connection servers send their ids. If the id of the
> peer
> > > > > initiating the connection is higher than the peer receiving the
> > > > connection,
> > > > > then everything proceeds, otherwise connection is rejected.
> > > > > Why do peers with a lower id try to open connections with the peers
> > > with
> > > > a
> > > > > higher id then?
> > > > > Because in that way, they trigger a connection attempt in the other
> > > > > direction. That is, say servers 2 and 3 have a working ensemble,
> they
> > > are
> > > > > not going to attempt to connect to server 1 unless something
> happens.
> > > > > Server 1 wakes up, it attempts to connect to servers 2 and 3 and
> > these
> > > > two
> > > > > connection attempts fail, but each triggers a connection attempt
> from
> > > the
> > > > > respective peer that succeeds.
> > > > > This was a bit more than you asked for, but anyway I hope it helps
> > :-).
> > > > If
> > > > > the answer doesn't work for you, please let me know.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > German.
> > > > >
> > > > >
> > > > > On Tue, Feb 11, 2014 at 10:53 AM, Tao Xiao <
> xiaotao.cs.nju@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I installed zookeeper 3.4.5 in a 3-node cluster and started
it. I
> > > think
> > > > > > zookeeper works well because the HBase cluster, which relies
on
> > > > > zookeeper,
> > > > > > indeed works well. But when I tried to query zookeeper's status,
> it
> > > > > > reported:
> > > > > >
> > > > > > [root@imon-1 zookeeper-3.4.5]# bin/zkServer.sh status
> > > > > > JMX enabled by default
> > > > > > Using config:
> /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg
> > > > > > Error contacting service. It is probably not running.
> > > > > >
> > > > > >
> > > > > > I checked the log and found it reported the following:
> > > > > >
> > > > > > 2014-02-11 15:42:15,623 [myid:] - INFO
>  [main:QuorumPeerConfig@101
> > ]
> > > -
> > > > > > Reading configuration from:
> > > > > > /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg
> > > > > > 2014-02-11 15:42:15,629 [myid:] - INFO
>  [main:QuorumPeerConfig@334
> > ]
> > > -
> > > > > > Defaulting to majority quorums
> > > > > > 2014-02-11 15:42:15,652 [myid:1] - INFO
> > >  [main:DatadirCleanupManager@78
> > > > ]
> > > > > -
> > > > > > autopurge.snapRetainCount set to 3
> > > > > > 2014-02-11 15:42:15,652 [myid:1] - INFO
> > >  [main:DatadirCleanupManager@79
> > > > ]
> > > > > -
> > > > > > autopurge.purgeInterval set to 0
> > > > > > 2014-02-11 15:42:15,653 [myid:1] - INFO
> > > >  [main:DatadirCleanupManager@101
> > > > > ]
> > > > > > -
> > > > > > Purge task is not scheduled.
> > > > > > 2014-02-11 15:42:15,683 [myid:1] - INFO  [main:QuorumPeerMain@127
> ]
> > -
> > > > > > Starting quorum peer
> > > > > > 2014-02-11 15:42:15,726 [myid:1] - INFO
> > >  [main:NIOServerCnxnFactory@94
> > > > ]
> > > > > -
> > > > > > binding to port 0.0.0.0/0.0.0.0:2181
> > > > > > 2014-02-11 15:42:15,759 [myid:1] - INFO  [main:QuorumPeer@913]
-
> > > > > tickTime
> > > > > > set to 2000
> > > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@933]
-
> > > > > > minSessionTimeout set to -1
> > > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@944]
-
> > > > > > maxSessionTimeout set to -1
> > > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@959]
-
> > > > > initLimit
> > > > > > set to 10
> > > > > > 2014-02-11 15:42:15,874 [myid:1] - INFO  [main:FileSnap@83]
-
> > > Reading
> > > > > > snapshot /var/data/zk/dataDir/version-2/snapshot.100072a68
> > > > > > 2014-02-11 15:42:18,211 [myid:1] - INFO
> > > > > >  [Thread-1:QuorumCnxManager$Listener@486] - My election bind
> port:
> > > > > > 0.0.0.0/0.0.0.0:3888
> > > > > > 2014-02-11 15:42:18,226 [myid:1] - INFO  [QuorumPeer[myid=1]/
> > > > > 0.0.0.0:2181
> > > > > > :QuorumPeer@670] - LOOKING
> > > > > > 2014-02-11 15:42:18,232 [myid:1] - INFO  [QuorumPeer[myid=1]/
> > > > > 0.0.0.0:2181
> > > > > > :FastLeaderElection@740] - New election. My id =  1, proposed
> > > > > > zxid=0x1000735ac
> > > > > > 2014-02-11 15:42:18,234 [myid:1] - INFO
> > > > > >  [WorkerReceiver[myid=1]:FastLeaderElection@542] -
> Notification: 1
> > > > > > (n.leader), 0x1000735ac (n.zxid), 0x1 (n.round), LOOKING
> > (n.state), 1
> > > > > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
> > > > > > 2014-02-11 15:42:18,256 [myid:1] - WARN
> > > > > >  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open
> channel
> > > to
> > > > 2
> > > > > at
> > > > > > election address imon-2/172.16.38.144:3888
> > > > > > java.net.ConnectException: Connection refused
> > > > > >         at java.net.PlainSocketImpl.socketConnect(Native Method)
> > > > > >         at
> > > java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> > > > > >         at
> > > > > >
> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> > > > > >         at
> > java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> > > > > >         at
> > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> > > > > >         at java.net.Socket.connect(Socket.java:529)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
> > > > > >         at java.lang.Thread.run(Thread.java:662)
> > > > > > 2014-02-11 15:42:18,268 [myid:1] - WARN
> > > > > >  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open
> channel
> > > to
> > > > 3
> > > > > at
> > > > > > election address imon-3/172.16.38.145:3888
> > > > > > java.net.ConnectException: Connection refused
> > > > > >         at java.net.PlainSocketImpl.socketConnect(Native Method)
> > > > > >         at
> > > java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> > > > > >         at
> > > > > >
> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> > > > > >         at
> > java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> > > > > >         at
> > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> > > > > >         at java.net.Socket.connect(Socket.java:529)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
> > > > > >         at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
> > > > > >         at java.lang.Thread.run(Thread.java:662)
> > > > > > 2014-02-11 15:42:18,436 [myid:1] - WARN  [QuorumPeer[myid=1]/
> > > > > 0.0.0.0:2181
> > > > > > :QuorumCnxManager@368] - Cannot open channel to 2 at election
> > > address
> > > > > > imon-2/172.16.38.144:3888
> > > > > > java.net.ConnectException: Connection refused
> > > > > >
> > > > > > ... ...
> > > > > >
> > > > > >
> > > > > > Do these warnings matter?  What is the reason.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message