zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elastic search <elastic....@gmail.com>
Subject Re: Observers Unusable
Date Thu, 15 Oct 2015 23:06:57 GMT
Thanks Bob and Flavio for your replies.

I am considering opening a JIRA ticket for this issue , unless somebody
sees this is not an Observer issue

On Wed, Oct 14, 2015 at 3:20 PM, Bob Sheehan <bsheehan@vmware.com> wrote:

> To more specific after leader election the old observer connections to
> previous leader are stale҆(send-q accumulates) but the tcp timeout after
> 15 minutes terminates these old connections and allows fresh TCP
> connection to new leader.
>
> Bob
>
> On 10/14/15, 3:17 PM, "Bob Sheehan" <bsheehan@vmware.com> wrote:
>
> >We face a similar issue with remote observers connecting to 3 node
> >cluster. Using netstat we see a send-q tcp traffic accumulating. We
> >believe this is a tcp timeout issue.
> >
> >Bob
> >
> >
> >On 10/14/15, 2:54 PM, "elastic search" <elastic.l.k@gmail.com> wrote:
> >
> >>When it outputs this message "There is a connection already for server" ,
> >>is it really valid connection or using the existing map in memory to
> >>figure
> >>if the connection already exists.
> >>If that is the case it should empty this map and initate a connection
> >>when
> >>it is LOOKING for a leader after a connection drop.
> >>Right now it waits 15 minutes before it does that and re-iniates a
> >>connection
> >>
> >> if (senderWorkerMap.get(sid) == null){
> >>   } else {
> >>            LOG.debug("There is a connection already for server " + sid);
> >>
> >>If the ping to the server is not working how is it able to send
> >>notifications . The logs says Sending notifications but the ping is not
> >>working at that time.
> >>
> >>I am not sure how to decrease the 15 minute wait period before the
> >>Observer
> >>re-iniates a connection.
> >>Or Alternatively how to make this work
> >>
> >>
> >>On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <fpj@apache.org>
> >>wrote:
> >>
> >>> The snapshot is really small, so it is unlikely to be it, unless the
> >>> connection is really crappy.
> >>>
> >>> -Flavio
> >>>
> >>> > On 14 Oct 2015, at 19:38, elastic search <elastic.l.k@gmail.com>
> >>>wrote:
> >>> >
> >>> > Hi Flavio
> >>> >
> >>> > Thanks for the reply.
> >>> >
> >>> > These are the config parameters
> >>> > tickTime=2000
> >>> > initLimit=10
> >>> > syncLimit=5
> >>> >
> >>> > Data size is
> >>> > -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
> >>> > -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
> >>> > -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
> >>> > -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
> >>> > -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
> >>> >
> >>> > These are all defaults and i have never deleted any log files.
> >>> > Do you still think this could be due to syncLimit or the snapshot
> >>>being
> >>> > large
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fpj@apache.org>
> >>> wrote:
> >>> >
> >>> >>
> >>> >>> On 14 Oct 2015, at 17:22, elastic search <elastic.l.k@gmail.com>
> >>> wrote:
> >>> >>>
> >>> >>> The link between the AWS and the local DataCenter was down
for
> >>>around 2
> >>> >>> minutes.
> >>> >>> I have been running ping continuously from the DataCenter to
the
> >>>AWS
> >>> and
> >>> >>> that wasn't responding for a few minutes.
> >>> >>>
> >>> >>> Are you saying since we see Send Notification messages in the
logs
> >>>,
> >>> that
> >>> >>> would mean the Observers are able to connect to the ZK .
> >>> >>> Only that the ZK server leader is not able to respond back
?
> >>> >>
> >>> >> Yeah, the logs indicate that the observer is being able to connect
> >>> ("There
> >>> >> is a connection already for server..." in the logs) but isn't able
> >>>to
> >>> make
> >>> >> progress, which implies that it isn't receiving responses.
> >>> >>
> >>> >>>
> >>> >>> This is what i see from the Server logs
> >>> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
> >>> >> 10.10.4.46:38161
> >>> >>> :LearnerHandler@633] - Unexpected exception causing shutdown
while
> >>> sock
> >>> >>> still open
> >>> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN
> >>> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection
broken
> >>>for
> >>> >> id
> >>> >>> 5, my id = 3, error =
> >>> >>>
> >>> >>
> >>> >> These lines belong to different parts of the code path. The first
> >>> actually
> >>> >> indicates that the observer found the leader, but could not keep
> >>>going.
> >>> For
> >>> >> this particular time period, it could be that the observer is trying
> >>>to
> >>> >> sync up with the leader, but the value of syncLimit is too low
and
> >>>the
> >>> >> leader is abandoning the observer. One reason for that to happen
is
> >>>the
> >>> >> snapshot being large (or your available bandwidth being low).
> >>> >>
> >>> >> The second line is from the part of the code that does connections
> >>>for
> >>> >> leader election. That one also says that the connection broke.
> >>> >>
> >>> >> -Flavio
> >>> >>
> >>> >>
> >>> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fpj@apache.org>
> >>> >> wrote:
> >>> >>>
> >>> >>>> Can you tell why the server wasn't responding to the notifications
> >>> from
> >>> >>>> the observer? The log file is from the observer and it
sounds like
> >>>it
> >>> is
> >>> >>>> being able to send messages out, but it isn't clear why
the server
> >>> isn't
> >>> >>>> responding.
> >>> >>>>
> >>> >>>> -Flavio
> >>> >>>>
> >>> >>>>> On 14 Oct 2015, at 01:51, elastic search <elastic.l.k@gmail.com>
> >>> >> wrote:
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> Hello Experts
> >>> >>>>>
> >>> >>>>> We have 2 Observers running in AWS connecting over
to local ZK
> >>> Ensemble
> >>> >>>> in our own DataCenter.
> >>> >>>>>
> >>> >>>>> There have been instances where we see network drop
for a minute
> >>> >> between
> >>> >>>> the networks.
> >>> >>>>> However the Observers take around 15 minutes to recover
even if
> >>>the
> >>> >>>> network outage is for a minute.
> >>> >>>>>
> >>> >>>>> From the logs
> >>> >>>>> java.net.SocketTimeoutException: Read timed out
> >>> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 400
> >>> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 800
> >>> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 1600
> >>> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 3200
> >>> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 6400
> >>> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 12800
> >>> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 25600
> >>> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 51200
> >>> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>>
> >>> >>>>> And then finally exits the QuorumCnxManager run loop
with the
> >>> following
> >>> >>>> message
> >>> >>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780]
-
> Connection
> >>> >>>> broken for id 2
> >>> >>>>>
> >>> >>>>> How can we ensure the observer does not go out for
service such a
> >>> long
> >>> >>>> duration ?
> >>> >>>>>
> >>> >>>>> Attached the full logs
> >>> >>>>>
> >>> >>>>> Please help
> >>> >>>>> Thanks
> >>> >>>>>
> >>> >>>>> <zookeeper.log.zip>
> >>> >>>>
> >>> >>>>
> >>> >>
> >>> >>
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message