zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Lindner <scott.a.lind...@gmail.com>
Subject Zookeeper cluster failing
Date Tue, 06 Mar 2012 16:22:56 GMT
Hi,

We are running 3.4.2 and have a development cluster of 3 servers running on
Windows and have run into some strange issues.  It seems to be related to
one of servers falling out of sync with the other 2 in the cluster.  I'll
describe the problem in more detail:

Everything was working fine yesterday and then all of a sudden late in the
morning the cluster seemed to get into a bad state and no clients could
connect to the cluster.  No changes have been made to the cluster for over
a week and things had been working smoothly up until now.  Looking at the
servers it appeared they were all able to communicate with each other,
perform leader election, but then shortly after we would get errors while
waiting for messages.  (Relevant snippets from the logs at various points
later in the day included below)

At this time no clients could connect and we would see "zookeeper sever not
running" messages as well.  (see logs)

If we took down 2 of the 3 servers in the cluster all clients could connect
and everything would function fine, but as soon as we introduce an
additional server to the cluster client connections would no longer work
and the errors described above would repeat.  Also, no configuration that
involved the out of sync server would allow the servers to properly
communicate to one another, but if we only involved the 2 servers that were
in sync then communication between the servers seemed to work without error.

We backed up the data from the out of sync server, deleted its data, and
brought back all 3 servers.  This seemed to solve the problem of the
servers having communication errors and the new / empty server slowly
synced its data. However even now no clients could connect to the cluster
if there was more than 1 server involved.

We then took down the entire cluster and moved to different ports (all 3
ports) and this allowed us to connect using the command line client.  The
only difference now is that we have about 5 or 6 application instances that
are not trying to talk to the cluster on the new ports - so the "traffic"
is less than the original configuration.  We did not try to update our
applications to point to the new ports as we really want to understand what
the problem is and how to resolve / avoid this in the future.

Any help / ideas you can provide would be greatly appreciated as we are
fairly stumped at this point.

As an aside, we tried various configuration options (tick time, init limit,
sync limit, etc) but they didn't seem to have any effect in terms of how
long it would wait prior to interrupting. It also doesn't seem that our
log4j configuration is being respected, but these are more minor to the
larger problem described above.

Thanks,
-Scott


2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.30.160:60736
2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.20.145:56169
2012-03-05 16:06:01,828 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.160:60736 (no session established for client)
2012-03-05 16:06:01,828 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.20.145:56169 (no session established for client)
2012-03-05 16:06:01,812 [myid:3] - INFO  [
zookeeper-windows3.vivisimo.com/10.17.20.164:13888:QuorumCnxManager$Listener@493]
- Received conne
ction request /10.17.20.142:35819
2012-03-05 16:06:01,828 [myid:3] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for id
1, my id = 3, error =
java.net.SocketException: socket closed
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.io.DataInputStream.readInt(Unknown Source)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
2012-03-05 16:06:01,828 [myid:3] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
2012-03-05 16:06:01,859 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@679] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown
Source)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
Source)
        at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
2012-03-05 16:06:01,859 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@688] - Send worker leaving thread
2012-03-05 16:06:01,859 [myid:3] - INFO
 [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3
(n.leader), 377957122068 (n.zxid)
, 1 (n.round), FOLLOWING (n.state), 1 (n.sid), 88 (n.peerEPoch), LEADING
(my state)
2012-03-05 16:06:02,203 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /192.168.0.238:47529
2012-03-05 16:06:02,203 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:06:02,203 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /192.168.0.238:47529 (no session established for client)



==================================





2012-03-05 16:07:06,468 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.100:61367 which had sessionid 0x335e4af5c7b0007
2012-03-05 16:07:06,468 [myid:3] - INFO
 [CommitProcessor:3:ZooKeeperServer@596] - Established session
0x335e4af5c7b0008 with negotiated tim
eout 15000 for client /192.168.0.238:47643
2012-03-05 16:07:06,468 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x335e4af5c7b0008, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
        at java.lang.Thread.run(Unknown Source)
2012-03-05 16:07:06,468 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /192.168.0.238:47643 which had sessionid 0x335e4af5c7b0008
2012-03-05 16:07:06,468 [myid:3] - INFO
 [CommitProcessor:3:ZooKeeperServer@596] - Established session
0x335e4af5c7b0009 with negotiated tim
eout 15000 for client /10.17.30.100:61372
2012-03-05 16:07:06,468 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x335e4af5c7b0009, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
        at java.lang.Thread.run(Unknown Source)






====================================




2012-03-05 16:05:50,234 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.20.145:56160
2012-03-05 16:05:50,234 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:05:50,234 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.20.145:56160 (no session established for client)
2012-03-05 16:05:50,421 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@679] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown
Source)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
Source)
        at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
2012-03-05 16:05:50,437 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@688] - Send worker leaving thread
2012-03-05 16:05:50,437 [myid:3] - INFO
 [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3
(n.leader), 377957122068 (n.zxid)
, 1 (n.round), FOLLOWING (n.state), 1 (n.sid), 88 (n.peerEPoch), LEADING
(my state)
2012-03-05 16:05:50,640 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.30.160:60716
2012-03-05 16:05:50,640 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:05:50,640 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.160:60716 (no session established for client)
2012-03-05 16:05:50,703 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.30.100:61281
2012-03-05 16:05:50,703 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:05:50,703 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.100:61281 (no session established for client)

=======

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message