zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elastic search <elastic....@gmail.com>
Subject Re: Observers Unusable
Date Wed, 14 Oct 2015 18:38:18 GMT
Hi Flavio

Thanks for the reply.

These are the config parameters
tickTime=2000
initLimit=10
syncLimit=5

Data size is
-rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
-rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
-rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
-rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
-rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf

These are all defaults and i have never deleted any log files.
Do you still think this could be due to syncLimit or the snapshot being
large



On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fpj@apache.org> wrote:

>
> > On 14 Oct 2015, at 17:22, elastic search <elastic.l.k@gmail.com> wrote:
> >
> > The link between the AWS and the local DataCenter was down for around 2
> > minutes.
> > I have been running ping continuously from the DataCenter to the AWS and
> > that wasn't responding for a few minutes.
> >
> > Are you saying since we see Send Notification messages in the logs , that
> > would mean the Observers are able to connect to the ZK .
> > Only that the ZK server leader is not able to respond back ?
>
> Yeah, the logs indicate that the observer is being able to connect ("There
> is a connection already for server..." in the logs) but isn't able to make
> progress, which implies that it isn't receiving responses.
>
> >
> > This is what i see from the Server logs
> > 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
> 10.10.4.46:38161
> > :LearnerHandler@633] - Unexpected exception causing shutdown while sock
> > still open
> > 2015-10-09 16:19:28,160 [myid:3] - WARN
> > [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for
> id
> > 5, my id = 3, error =
> >
>
> These lines belong to different parts of the code path. The first actually
> indicates that the observer found the leader, but could not keep going. For
> this particular time period, it could be that the observer is trying to
> sync up with the leader, but the value of syncLimit is too low and the
> leader is abandoning the observer. One reason for that to happen is the
> snapshot being large (or your available bandwidth being low).
>
> The second line is from the part of the code that does connections for
> leader election. That one also says that the connection broke.
>
> -Flavio
>
>
> > On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fpj@apache.org>
> wrote:
> >
> >> Can you tell why the server wasn't responding to the notifications from
> >> the observer? The log file is from the observer and it sounds like it is
> >> being able to send messages out, but it isn't clear why the server isn't
> >> responding.
> >>
> >> -Flavio
> >>
> >>> On 14 Oct 2015, at 01:51, elastic search <elastic.l.k@gmail.com>
> wrote:
> >>>
> >>>
> >>> Hello Experts
> >>>
> >>> We have 2 Observers running in AWS connecting over to local ZK Ensemble
> >> in our own DataCenter.
> >>>
> >>> There have been instances where we see network drop for a minute
> between
> >> the networks.
> >>> However the Observers take around 15 minutes to recover even if the
> >> network outage is for a minute.
> >>>
> >>> From the logs
> >>> java.net.SocketTimeoutException: Read timed out
> >>> 2015-10-13 22:26:03,927 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 400
> >>> 2015-10-13 22:26:04,328 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 800
> >>> 2015-10-13 22:26:05,129 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 1600
> >>> 2015-10-13 22:26:06,730 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 3200
> >>> 2015-10-13 22:26:09,931 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 6400
> >>> 2015-10-13 22:26:16,332 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 12800
> >>> 2015-10-13 22:26:29,133 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 25600
> >>> 2015-10-13 22:26:54,734 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 51200
> >>> 2015-10-13 22:27:45,935 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:28:45,936 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:29:45,937 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:30:45,938 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:31:45,939 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:32:45,940 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:33:45,941 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:34:45,942 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:35:45,943 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:36:45,944 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:37:45,945 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:38:45,946 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:39:45,947 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:40:45,948 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:41:45,949 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>>
> >>> And then finally exits the QuorumCnxManager run loop with the following
> >> message
> >>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
> >> broken for id 2
> >>>
> >>> How can we ensure the observer does not go out for service such a long
> >> duration ?
> >>>
> >>> Attached the full logs
> >>>
> >>> Please help
> >>> Thanks
> >>>
> >>> <zookeeper.log.zip>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message