zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleksandr Sokalskyi1 <Oleksandr_Sokalsk...@epam.com>
Subject Re: Zookeeper cluster suddenly crashes. Zk version 3.4.6
Date Fri, 15 May 2015 13:54:31 GMT
Hi Patrick,
Thanks for your reply.

We use same configuration for all zookeepers:

tickTime=2000
initLimit=5
syncLimit=2
dataDir=/app/lib/zookeeper
clientPort=2181
autopurge.snapRetainCount=3
autopurge.purgeInterval=1

server.1=zk1.company.com:2888:3888
server.2=zk2.company.com:2888:3888
server.3=zk3.company.com:2888:3888


Thank you.
________________________________________
From: Patrick Hunt <phunt@apache.org>
Sent: Wednesday, May 13, 2015 12:56 AM
To: UserZooKeeper
Subject: Re: Zookeeper cluster suddenly crashes. Zk version 3.4.6

Hi Oleksandr, I think this is the issue, from the leader:

"Closing connection to peer due to transaction timeout."

Can you send your configuration from the servers? It seems that you're
exceeding the sync limit (see LearnerHandler@687).

Perhaps you have some network or other activity at that time of night
that's causing the servers to fail to communicate in a timely fashion?
(within the sync limit)

Patrick


On Mon, May 11, 2015 at 10:36 AM, Oleksandr Sokalskyi1 <
Oleksandr_Sokalskyi1@epam.com> wrote:

> Hi All,
>
>
>
> We are using Apache Storm with Apache Zookeepers.
>
> In our configuration we have Zookeeper cluster with 3 nodes in it.
> Everything was well. But recently we faced issue that zookeepers got
> restarted during night, although there are no calculations or any load
> during at that period. And it continuously happening on each night.
>
> Zookeeper version is 3.4.6.
>
>
>
> Here are the logs for time of failure.
>
>
>
> On follower:
>
> 2015-05-02 08:20:52,603 [myid:3] - INFO
> [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
>
> 2015-05-02 08:20:53,044 [myid:3] - INFO
> [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
>
> 2015-05-02 08:45:03,416 [myid:3] - WARN  [QuorumPeer[myid=3]/0.0.0.0:2181
> :Follower@89] - Exception when following the leader
>
> java.io.EOFException
>
>                 at
> java.io.DataInputStream.readInt(DataInputStream.java:392)
>
>                 at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>
>                 at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>
>                 at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>
>                 at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>
>                 at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>
>                 at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
>
> 2015-05-02 08:45:03,417 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :Follower@166] - shutdown called
>
> java.lang.Exception: shutdown Follower
>
>                 at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>
>                 at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
>
> 2015-05-02 08:45:03,418 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :NIOServerCnxn@1007] - Closed socket connection for client /
> 10.49.5.15:41692 which had sessionid 0x34cef2f0370017c
>
> 2015-05-02 08:45:03,418 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :NIOServerCnxn@1007] - Closed socket connection for client /
> 10.49.5.15:41687 which had sessionid 0x34cef2f0370017a
>
> 2015-05-02 08:45:03,419 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :NIOServerCnxn@1007] - Closed socket connection for client /
> 10.49.5.19:40478 which had sessionid 0x34cef2f0370017d
>
> 2015-05-02 08:45:03,419 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :NIOServerCnxn@1007] - Closed socket connection for client /
> 10.49.5.19:40474 which had sessionid 0x34cef2f0370017b
>
> 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :NIOServerCnxn@1007] - Closed socket connection for client /
> 10.49.5.17:55431 which had sessionid 0x34cef2f03700179
>
> 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :FollowerZooKeeperServer@139] - Shutting down
>
> 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :ZooKeeperServer@441] - shutting down
>
> 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :FollowerRequestProcessor@105] - Shutting down
>
> 2015-05-02 08:45:03,421 [myid:3] - INFO
> [FollowerRequestProcessor:3:FollowerRequestProcessor@95] -
> FollowerRequestProcessor exited loop!
>
> 2015-05-02 08:45:03,421 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :CommitProcessor@181] - Shutting down
>
> 2015-05-02 08:45:03,422 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :FinalRequestProcessor@415] - shutdown of request processor complete
>
> 2015-05-02 08:45:03,422 [myid:3] - INFO
> [CommitProcessor:3:CommitProcessor@150] - CommitProcessor exited loop!
>
> 2015-05-02 08:45:03,422 [myid:3] - INFO  [QuorumPeer[myid=3]/0.0.0.0:2181
> :SyncRequestProcessor@209] - Shutting down
>
>
>
>
>
> On leader:
>
>
>
> 2015-05-02 08:20:45,078 [myid:2] - INFO
> [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
>
> 2015-05-02 08:20:45,403 [myid:2] - INFO
> [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
>
> 2015-05-02 08:45:03,415 [myid:2] - WARN  [QuorumPeer[myid=2]/0.0.0.0:2181
> :LearnerHandler@687] - Closing connection to peer due to transaction
> timeout.
>
> 2015-05-02 08:45:03,415 [myid:2] - WARN  [QuorumPeer[myid=2]/0.0.0.0:2181
> :LearnerHandler@687] - Closing connection to peer due to transaction
> timeout.
>
> 2015-05-02 08:45:03,416 [myid:2] - WARN
> [LearnerHandler-/148.112.48.199:43202:LearnerHandler@646] - *******
> GOODBYE /148.112.48.199:43202 ********
>
> 2015-05-02 08:45:03,417 [myid:2] - WARN
> [LearnerHandler-/148.112.48.199:43202:LearnerHandler@658] - Ignoring
> unexpected exception
>
> java.lang.InterruptedException
>
>                 at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>
>                 at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>
>                 at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>
>                 at
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
>
>                 at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
>
> 2015-05-02 08:45:03,417 [myid:2] - WARN
> [LearnerHandler-/148.112.48.201:33870:LearnerHandler@646] - *******
> GOODBYE /148.112.48.201:33870 ********
>
> 2015-05-02 08:45:03,417 [myid:2] - WARN
> [LearnerHandler-/148.112.48.201:33870:LearnerHandler@658] - Ignoring
> unexpected exception
>
> java.lang.InterruptedException
>
>                 at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>
>                 at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>
>                 at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>
>                 at
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
>
>                 at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
>
> 2015-05-02 08:45:03,536 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> connection from /10.49.5.16:53812
>
> 2015-05-02 08:45:03,539 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew
> session 0x14cef2ee4c80171 at /10.49.5.16:53812
>
> 2015-05-02 08:45:03,539 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session
> 0x14cef2ee4c80171 with negotiated timeout 20000 for client /
> 10.49.5.16:53812
>
> 2015-05-02 08:45:03,582 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> connection from /10.49.5.19:37378
>
> 2015-05-02 08:45:03,585 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew
> session 0x14cef2ee4c80173 at /10.49.5.19:37378
>
> 2015-05-02 08:45:03,586 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session
> 0x14cef2ee4c80173 with negotiated timeout 20000 for client /
> 10.49.5.19:37378
>
> 2015-05-02 08:45:03,606 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> connection from /10.49.5.17:40379
>
> ............
>
> 2015-05-02 08:45:04,379 [myid:2] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session
> 0x14cef2ee4c8016f with negotiated timeout 20000 for client /
> 10.49.5.18:52703
>
> 2015-05-02 08:45:04,416 [myid:2] - INFO  [QuorumPeer[myid=2]/0.0.0.0:2181
> :Leader@493] - Shutting down
>
> 2015-05-02 08:45:04,416 [myid:2] - INFO  [QuorumPeer[myid=2]/0.0.0.0:2181
> :Leader@499] - Shutdown called
>
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers
> synced, only synced with sids: [ 2 ]
>
>                 at
> org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
>
>                 at
> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:474)
>
>                 at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799)
>
> 2015-05-02 08:45:04,417 [myid:2] - INFO  [QuorumPeer[myid=2]/0.0.0.0:2181
> :NIOServerCnxn@1007] - Closed socket connection for client /
> 10.49.5.17:53780 which had sessionid 0x24cef2ee4cf016a
>
>
>
>
>
> I don't understand why it's happening during nights, when there are no
> load. And everything running ok during day.
>
> Can you help me to understand the root cause.
>
> Will increasing syncLimit parameter? resolve this issue?
>
>
>
> Thanks.
>
>

Mime
View raw message