zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Zookeeper cluster suddenly crashes. Zk version 3.4.6
Date Mon, 18 May 2015 16:47:07 GMT
Your limits seem low, I usually run with sync limit of 5 and init limit of
10. Perhaps give that a try? Right now your sync limit is just 4 seconds.
Regardless you'd need to see what's happening on the network and the hosts
during the period when you lose connectivity if you really want to track
this down.

Patrick

On Fri, May 15, 2015 at 6:54 AM, Oleksandr Sokalskyi1 <
Oleksandr_Sokalskyi1@epam.com> wrote:

> Hi Patrick,
> Thanks for your reply.
>
> We use same configuration for all zookeepers:
>
> tickTime=2000
> initLimit=5
> syncLimit=2
> dataDir=/app/lib/zookeeper
> clientPort=2181
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
>
> server.1=zk1.company.com:2888:3888
> server.2=zk2.company.com:2888:3888
> server.3=zk3.company.com:2888:3888
>
>
> Thank you.
> ________________________________________
> From: Patrick Hunt <phunt@apache.org>
> Sent: Wednesday, May 13, 2015 12:56 AM
> To: UserZooKeeper
> Subject: Re: Zookeeper cluster suddenly crashes. Zk version 3.4.6
>
> Hi Oleksandr, I think this is the issue, from the leader:
>
> "Closing connection to peer due to transaction timeout."
>
> Can you send your configuration from the servers? It seems that you're
> exceeding the sync limit (see LearnerHandler@687).
>
> Perhaps you have some network or other activity at that time of night
> that's causing the servers to fail to communicate in a timely fashion?
> (within the sync limit)
>
> Patrick
>
>
> On Mon, May 11, 2015 at 10:36 AM, Oleksandr Sokalskyi1 <
> Oleksandr_Sokalskyi1@epam.com> wrote:
>
> > Hi All,
> >
> >
> >
> > We are using Apache Storm with Apache Zookeepers.
> >
> > In our configuration we have Zookeeper cluster with 3 nodes in it.
> > Everything was well. But recently we faced issue that zookeepers got
> > restarted during night, although there are no calculations or any load
> > during at that period. And it continuously happening on each night.
> >
> > Zookeeper version is 3.4.6.
> >
> >
> >
> > Here are the logs for time of failure.
> >
> >
> >
> > On follower:
> >
> > 2015-05-02 08:20:52,603 [myid:3] - INFO
> > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
> >
> > 2015-05-02 08:20:53,044 [myid:3] - INFO
> > [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
> >
> > 2015-05-02 08:45:03,416 [myid:3] - WARN  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :Follower@89] - Exception when following the leader
> >
> > java.io.EOFException
> >
> >                 at
> > java.io.DataInputStream.readInt(DataInputStream.java:392)
> >
> >                 at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >
> >                 at
> >
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >
> >                 at
> >
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >
> >                 at
> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> >
> >                 at
> >
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> >
> >                 at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> >
> > 2015-05-02 08:45:03,417 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :Follower@166] - shutdown called
> >
> > java.lang.Exception: shutdown Follower
> >
> >                 at
> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> >
> >                 at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> >
> > 2015-05-02 08:45:03,418 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :NIOServerCnxn@1007] - Closed socket connection for client /
> > 10.49.5.15:41692 which had sessionid 0x34cef2f0370017c
> >
> > 2015-05-02 08:45:03,418 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :NIOServerCnxn@1007] - Closed socket connection for client /
> > 10.49.5.15:41687 which had sessionid 0x34cef2f0370017a
> >
> > 2015-05-02 08:45:03,419 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :NIOServerCnxn@1007] - Closed socket connection for client /
> > 10.49.5.19:40478 which had sessionid 0x34cef2f0370017d
> >
> > 2015-05-02 08:45:03,419 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :NIOServerCnxn@1007] - Closed socket connection for client /
> > 10.49.5.19:40474 which had sessionid 0x34cef2f0370017b
> >
> > 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :NIOServerCnxn@1007] - Closed socket connection for client /
> > 10.49.5.17:55431 which had sessionid 0x34cef2f03700179
> >
> > 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :FollowerZooKeeperServer@139] - Shutting down
> >
> > 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :ZooKeeperServer@441] - shutting down
> >
> > 2015-05-02 08:45:03,420 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :FollowerRequestProcessor@105] - Shutting down
> >
> > 2015-05-02 08:45:03,421 [myid:3] - INFO
> > [FollowerRequestProcessor:3:FollowerRequestProcessor@95] -
> > FollowerRequestProcessor exited loop!
> >
> > 2015-05-02 08:45:03,421 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :CommitProcessor@181] - Shutting down
> >
> > 2015-05-02 08:45:03,422 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :FinalRequestProcessor@415] - shutdown of request processor complete
> >
> > 2015-05-02 08:45:03,422 [myid:3] - INFO
> > [CommitProcessor:3:CommitProcessor@150] - CommitProcessor exited loop!
> >
> > 2015-05-02 08:45:03,422 [myid:3] - INFO  [QuorumPeer[myid=3]/
> 0.0.0.0:2181
> > :SyncRequestProcessor@209] - Shutting down
> >
> >
> >
> >
> >
> > On leader:
> >
> >
> >
> > 2015-05-02 08:20:45,078 [myid:2] - INFO
> > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
> >
> > 2015-05-02 08:20:45,403 [myid:2] - INFO
> > [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
> >
> > 2015-05-02 08:45:03,415 [myid:2] - WARN  [QuorumPeer[myid=2]/
> 0.0.0.0:2181
> > :LearnerHandler@687] - Closing connection to peer due to transaction
> > timeout.
> >
> > 2015-05-02 08:45:03,415 [myid:2] - WARN  [QuorumPeer[myid=2]/
> 0.0.0.0:2181
> > :LearnerHandler@687] - Closing connection to peer due to transaction
> > timeout.
> >
> > 2015-05-02 08:45:03,416 [myid:2] - WARN
> > [LearnerHandler-/148.112.48.199:43202:LearnerHandler@646] - *******
> > GOODBYE /148.112.48.199:43202 ********
> >
> > 2015-05-02 08:45:03,417 [myid:2] - WARN
> > [LearnerHandler-/148.112.48.199:43202:LearnerHandler@658] - Ignoring
> > unexpected exception
> >
> > java.lang.InterruptedException
> >
> >                 at
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> >
> >                 at
> >
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> >
> >                 at
> >
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> >
> >                 at
> >
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
> >
> >                 at
> >
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
> >
> > 2015-05-02 08:45:03,417 [myid:2] - WARN
> > [LearnerHandler-/148.112.48.201:33870:LearnerHandler@646] - *******
> > GOODBYE /148.112.48.201:33870 ********
> >
> > 2015-05-02 08:45:03,417 [myid:2] - WARN
> > [LearnerHandler-/148.112.48.201:33870:LearnerHandler@658] - Ignoring
> > unexpected exception
> >
> > java.lang.InterruptedException
> >
> >                 at
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> >
> >                 at
> >
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> >
> >                 at
> >
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> >
> >                 at
> >
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
> >
> >                 at
> >
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
> >
> > 2015-05-02 08:45:03,536 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> > connection from /10.49.5.16:53812
> >
> > 2015-05-02 08:45:03,539 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew
> > session 0x14cef2ee4c80171 at /10.49.5.16:53812
> >
> > 2015-05-02 08:45:03,539 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session
> > 0x14cef2ee4c80171 with negotiated timeout 20000 for client /
> > 10.49.5.16:53812
> >
> > 2015-05-02 08:45:03,582 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> > connection from /10.49.5.19:37378
> >
> > 2015-05-02 08:45:03,585 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew
> > session 0x14cef2ee4c80173 at /10.49.5.19:37378
> >
> > 2015-05-02 08:45:03,586 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session
> > 0x14cef2ee4c80173 with negotiated timeout 20000 for client /
> > 10.49.5.19:37378
> >
> > 2015-05-02 08:45:03,606 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> > connection from /10.49.5.17:40379
> >
> > ............
> >
> > 2015-05-02 08:45:04,379 [myid:2] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session
> > 0x14cef2ee4c8016f with negotiated timeout 20000 for client /
> > 10.49.5.18:52703
> >
> > 2015-05-02 08:45:04,416 [myid:2] - INFO  [QuorumPeer[myid=2]/
> 0.0.0.0:2181
> > :Leader@493] - Shutting down
> >
> > 2015-05-02 08:45:04,416 [myid:2] - INFO  [QuorumPeer[myid=2]/
> 0.0.0.0:2181
> > :Leader@499] - Shutdown called
> >
> > java.lang.Exception: shutdown Leader! reason: Not sufficient followers
> > synced, only synced with sids: [ 2 ]
> >
> >                 at
> > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
> >
> >                 at
> > org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:474)
> >
> >                 at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799)
> >
> > 2015-05-02 08:45:04,417 [myid:2] - INFO  [QuorumPeer[myid=2]/
> 0.0.0.0:2181
> > :NIOServerCnxn@1007] - Closed socket connection for client /
> > 10.49.5.17:53780 which had sessionid 0x24cef2ee4cf016a
> >
> >
> >
> >
> >
> > I don't understand why it's happening during nights, when there are no
> > load. And everything running ok during day.
> >
> > Can you help me to understand the root cause.
> >
> > Will increasing syncLimit parameter? resolve this issue?
> >
> >
> >
> > Thanks.
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message