zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@apache.org>
Subject Re: Unstable work of zookeeper
Date Thu, 24 Sep 2015 22:16:02 GMT
The election connections don't have to be set for the cluster to operate. Am I missing anything?

-Flavio

> On 24 Sep 2015, at 11:38, Raúl Gutiérrez Segalés <rgs@itevenworks.net> wrote:
> 
> On 24 September 2015 at 06:36, Flavio Junqueira <fpj@apache.org> wrote:
> 
>> I can see that the client is disconnecting from the server, and there is
>> also a new round of leader election for the zookeeper servers. If this is
>> happening, then yeah, your ensemble is unstable. If the ensemble leader
>> election is being triggered frequently, then I'd start by looking there.
>> Try to determine why the ensemble is failing to continue with the same
>> leader. If ensemble elections aren't happening frequently, then another
>> possibility is that GC pauses are causing the session to expire.
>> 
> 
> On the other hand, if it's a low traffic cluster you might need to enable
> TCP keepalives  to ensure election connections between the cluster members
> don't go away (the ZAB connections on the other hand, iirc, have protocol
> level pings so those are fine.. I think):
> 
> https://issues.apache.org/jira/browse/ZOOKEEPER-1748
> 
> 
> -rgs
> 
> 
> 
>> -Flavio
>> 
>>> On 24 Sep 2015, at 05:24, Akmal Abbasov <akmal.abbasov@icloud.com>
>> wrote:
>>> 
>>> Hi,
>>> I am using zookeeper 3.4.6
>>> I have a spark cluster configured with HA. Once per 1-2 days, the active
>> spark master is shutting down with a message
>>> 15/09/23 18:58:18 INFO zookeeper.ClientCnxn: Unable to read additional
>> data from server sessionid 0x34ffa68dbd10021, likely server has closed
>> socket, closing socket connection and attempting reconnect
>>> 15/09/23 18:58:18 INFO state.ConnectionStateManager: State change:
>> SUSPENDED
>>> 15/09/23 18:58:18 INFO master.ZooKeeperLeaderElectionAgent: We have lost
>> leadership
>>> 15/09/23 18:58:18 ERROR master.Master: Leadership has been revoked --
>> master shutting down.
>>> 15/09/23 18:58:18 INFO util.Utils: Shutdown hook called
>>> 
>>> I don’t have the zookeeper logs from the same period, but the logs are
>> full of the these messages
>>> 2015-09-24 05:07:42,228 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
>> connection from /10.0.8.4:34705
>>> 2015-09-24 05:07:42,229 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection request from old
>> client /10.0.8.4:34705; will be dropped if server is in r-o mode
>>> 2015-09-24 05:07:42,229 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
>> establish new session at /10.0.8.4:34705
>>> 2015-09-24 05:07:42,292 [myid:1] - INFO
>> [CommitProcessor:1:ZooKeeperServer@617] - Established session
>> 0x14ffd3670130030 with negotiated timeout 20001 for client /10.0.8.4:34705
>>> 2015-09-24 05:07:42,302 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
>>> EndOfStreamException: Unable to read additional data from client
>> sessionid 0x14ffd3670130030, likely client has closed socket
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>>      at java.lang.Thread.run(Thread.java:745)
>>> 2015-09-24 05:07:42,303 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
>> client /10.0.8.4:34705 which had sessionid 0x14ffd3670130030
>>> 2015-09-24 05:07:42,314 [myid:1] - ERROR
>> [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
>>> java.nio.channels.CancelledKeyException
>>>      at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>>      at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>>>      at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>>>      at
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
>>> 2015-09-24 05:07:42,334 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
>> connection from /10.0.8.4:34707
>>> 2015-09-24 05:07:42,334 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection request from old
>> client /10.0.8.4:34707; will be dropped if server is in r-o mode
>>> 2015-09-24 05:07:42,335 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
>> establish new session at /10.0.8.4:34707
>>> 2015-09-24 05:07:42,357 [myid:1] - INFO
>> [CommitProcessor:1:ZooKeeperServer@617] - Established session
>> 0x14ffd3670130031 with negotiated timeout 20001 for client /10.0.8.4:34707
>>> 2015-09-24 05:07:42,364 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
>>> EndOfStreamException: Unable to read additional data from client
>> sessionid 0x14ffd3670130031, likely client has closed socket
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>>      at java.lang.Thread.run(Thread.java:745)
>>> 2015-09-24 05:07:42,365 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
>> client /10.0.8.4:34707 which had sessionid 0x14ffd3670130031
>>> 2015-09-24 05:07:42,376 [myid:1] - ERROR
>> [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
>>> java.nio.channels.CancelledKeyException
>>>      at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>>      at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>>>      at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>>>      at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>>>      at
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
>>> 
>>> Also there are
>>> 2015-09-24 06:29:54,459 [myid:1] - INFO
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] -
>> Shutting down
>>> 2015-09-24 06:29:54,459 [myid:1] - INFO
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@441] - shutting
>> down
>>> 2015-09-24 06:29:54,459 [myid:1] - INFO
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@105] -
>> Shutting down
>>> 2015-09-24 06:29:54,459 [myid:1] - INFO
>> [FollowerRequestProcessor:1:FollowerRequestProcessor@95] -
>> FollowerRequestProcessor exited loop!
>>> 2015-09-24 06:29:54,460 [myid:1] - INFO
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@181] - Shutting
>> down
>>> 2015-09-24 06:29:54,464 [myid:1] - INFO
>> [CommitProcessor:1:CommitProcessor@150] - CommitProcessor exited loop!
>>> 2015-09-24 06:29:54,465 [myid:1] - INFO
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] -
>> shutdown of request processor complete
>>> 2015-09-24 06:29:54,466 [myid:1] - INFO
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] -
>> Shutting down
>>> 2015-09-24 06:29:54,466 [myid:1] - INFO
>> [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited!
>>> 2015-09-24 06:29:54,466 [myid:1] - INFO
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@714] - LOOKING
>>> 2015-09-24 06:29:54,584 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
>> connection from /10.0.8.58:36137
>>> 2015-09-24 06:29:54,584 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
>> session 0x0 due to java.io.IOException: ZooKeeperServer not running
>>> 2015-09-24 06:29:54,584 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
>> client /10.0.8.58:36137 (no session established for client)
>>> 2015-09-24 06:29:54,679 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
>> connection from /10.0.8.57:57410
>>> 2015-09-24 06:29:54,680 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
>> session 0x0 due to java.io.IOException: ZooKeeperServer not running
>>> 2015-09-24 06:29:54,680 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
>> client /10.0.8.57:57410 (no session established for client)
>>> 
>>> I also observed that hadoop-zkfc restarts very frequently.
>>> Any ideas what could be wrong?
>>> 
>>> Thanks.
>> 
>> 


Mime
View raw message