zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Sheehan <bshee...@vmware.com>
Subject Re: Observers Unusable
Date Wed, 14 Oct 2015 22:20:38 GMT
To more specific after leader election the old observer connections to
previous leader are stale҆(send-q accumulates) but the tcp timeout after
15 minutes terminates these old connections and allows fresh TCP
connection to new leader.

Bob

On 10/14/15, 3:17 PM, "Bob Sheehan" <bsheehan@vmware.com> wrote:

>We face a similar issue with remote observers connecting to 3 node
>cluster. Using netstat we see a send-q tcp traffic accumulating. We
>believe this is a tcp timeout issue.
>
>Bob
>
>
>On 10/14/15, 2:54 PM, "elastic search" <elastic.l.k@gmail.com> wrote:
>
>>When it outputs this message "There is a connection already for server" ,
>>is it really valid connection or using the existing map in memory to
>>figure
>>if the connection already exists.
>>If that is the case it should empty this map and initate a connection
>>when
>>it is LOOKING for a leader after a connection drop.
>>Right now it waits 15 minutes before it does that and re-iniates a
>>connection
>>
>> if (senderWorkerMap.get(sid) == null){
>>   } else {
>>            LOG.debug("There is a connection already for server " + sid);
>>
>>If the ping to the server is not working how is it able to send
>>notifications . The logs says Sending notifications but the ping is not
>>working at that time.
>>
>>I am not sure how to decrease the 15 minute wait period before the
>>Observer
>>re-iniates a connection.
>>Or Alternatively how to make this work
>>
>>
>>On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <fpj@apache.org>
>>wrote:
>>
>>> The snapshot is really small, so it is unlikely to be it, unless the
>>> connection is really crappy.
>>>
>>> -Flavio
>>>
>>> > On 14 Oct 2015, at 19:38, elastic search <elastic.l.k@gmail.com>
>>>wrote:
>>> >
>>> > Hi Flavio
>>> >
>>> > Thanks for the reply.
>>> >
>>> > These are the config parameters
>>> > tickTime=2000
>>> > initLimit=10
>>> > syncLimit=5
>>> >
>>> > Data size is
>>> > -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
>>> > -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
>>> > -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
>>> > -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
>>> > -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
>>> >
>>> > These are all defaults and i have never deleted any log files.
>>> > Do you still think this could be due to syncLimit or the snapshot
>>>being
>>> > large
>>> >
>>> >
>>> >
>>> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fpj@apache.org>
>>> wrote:
>>> >
>>> >>
>>> >>> On 14 Oct 2015, at 17:22, elastic search <elastic.l.k@gmail.com>
>>> wrote:
>>> >>>
>>> >>> The link between the AWS and the local DataCenter was down for
>>>around 2
>>> >>> minutes.
>>> >>> I have been running ping continuously from the DataCenter to the
>>>AWS
>>> and
>>> >>> that wasn't responding for a few minutes.
>>> >>>
>>> >>> Are you saying since we see Send Notification messages in the logs
>>>,
>>> that
>>> >>> would mean the Observers are able to connect to the ZK .
>>> >>> Only that the ZK server leader is not able to respond back ?
>>> >>
>>> >> Yeah, the logs indicate that the observer is being able to connect
>>> ("There
>>> >> is a connection already for server..." in the logs) but isn't able
>>>to
>>> make
>>> >> progress, which implies that it isn't receiving responses.
>>> >>
>>> >>>
>>> >>> This is what i see from the Server logs
>>> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
>>> >> 10.10.4.46:38161
>>> >>> :LearnerHandler@633] - Unexpected exception causing shutdown while
>>> sock
>>> >>> still open
>>> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN
>>> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken
>>>for
>>> >> id
>>> >>> 5, my id = 3, error =
>>> >>>
>>> >>
>>> >> These lines belong to different parts of the code path. The first
>>> actually
>>> >> indicates that the observer found the leader, but could not keep
>>>going.
>>> For
>>> >> this particular time period, it could be that the observer is trying
>>>to
>>> >> sync up with the leader, but the value of syncLimit is too low and
>>>the
>>> >> leader is abandoning the observer. One reason for that to happen is
>>>the
>>> >> snapshot being large (or your available bandwidth being low).
>>> >>
>>> >> The second line is from the part of the code that does connections
>>>for
>>> >> leader election. That one also says that the connection broke.
>>> >>
>>> >> -Flavio
>>> >>
>>> >>
>>> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fpj@apache.org>
>>> >> wrote:
>>> >>>
>>> >>>> Can you tell why the server wasn't responding to the notifications
>>> from
>>> >>>> the observer? The log file is from the observer and it sounds
like
>>>it
>>> is
>>> >>>> being able to send messages out, but it isn't clear why the
server
>>> isn't
>>> >>>> responding.
>>> >>>>
>>> >>>> -Flavio
>>> >>>>
>>> >>>>> On 14 Oct 2015, at 01:51, elastic search <elastic.l.k@gmail.com>
>>> >> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>> Hello Experts
>>> >>>>>
>>> >>>>> We have 2 Observers running in AWS connecting over to local
ZK
>>> Ensemble
>>> >>>> in our own DataCenter.
>>> >>>>>
>>> >>>>> There have been instances where we see network drop for
a minute
>>> >> between
>>> >>>> the networks.
>>> >>>>> However the Observers take around 15 minutes to recover
even if
>>>the
>>> >>>> network outage is for a minute.
>>> >>>>>
>>> >>>>> From the logs
>>> >>>>> java.net.SocketTimeoutException: Read timed out
>>> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 400
>>> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 800
>>> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 1600
>>> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 3200
>>> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 6400
>>> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 12800
>>> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 25600
>>> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 51200
>>> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
-
>>> >>>> Notification time out: 60000
>>> >>>>>
>>> >>>>> And then finally exits the QuorumCnxManager run loop with
the
>>> following
>>> >>>> message
>>> >>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
>>> >>>> broken for id 2
>>> >>>>>
>>> >>>>> How can we ensure the observer does not go out for service
such a
>>> long
>>> >>>> duration ?
>>> >>>>>
>>> >>>>> Attached the full logs
>>> >>>>>
>>> >>>>> Please help
>>> >>>>> Thanks
>>> >>>>>
>>> >>>>> <zookeeper.log.zip>
>>> >>>>
>>> >>>>
>>> >>
>>> >>
>>>
>>>
>


Mime
View raw message