zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elastic search <elastic....@gmail.com>
Subject Re: Observers Unusable
Date Wed, 14 Oct 2015 16:22:01 GMT
The link between the AWS and the local DataCenter was down for around 2
minutes.
I have been running ping continuously from the DataCenter to the AWS and
that wasn't responding for a few minutes.

Are you saying since we see Send Notification messages in the logs , that
would mean the Observers are able to connect to the ZK .
Only that the ZK server leader is not able to respond back ?

This is what i see from the Server logs
2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/10.10.4.46:38161
:LearnerHandler@633] - Unexpected exception causing shutdown while sock
still open
2015-10-09 16:19:28,160 [myid:3] - WARN
[RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for id
5, my id = 3, error =

On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fpj@apache.org> wrote:

> Can you tell why the server wasn't responding to the notifications from
> the observer? The log file is from the observer and it sounds like it is
> being able to send messages out, but it isn't clear why the server isn't
> responding.
>
> -Flavio
>
> > On 14 Oct 2015, at 01:51, elastic search <elastic.l.k@gmail.com> wrote:
> >
> >
> > Hello Experts
> >
> > We have 2 Observers running in AWS connecting over to local ZK Ensemble
> in our own DataCenter.
> >
> > There have been instances where we see network drop for a minute between
> the networks.
> > However the Observers take around 15 minutes to recover even if the
> network outage is for a minute.
> >
> > From the logs
> > java.net.SocketTimeoutException: Read timed out
> > 2015-10-13 22:26:03,927 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 400
> > 2015-10-13 22:26:04,328 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 800
> > 2015-10-13 22:26:05,129 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 1600
> > 2015-10-13 22:26:06,730 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 3200
> > 2015-10-13 22:26:09,931 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 6400
> > 2015-10-13 22:26:16,332 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 12800
> > 2015-10-13 22:26:29,133 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 25600
> > 2015-10-13 22:26:54,734 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 51200
> > 2015-10-13 22:27:45,935 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:28:45,936 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:29:45,937 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:30:45,938 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:31:45,939 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:32:45,940 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:33:45,941 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:34:45,942 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:35:45,943 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:36:45,944 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:37:45,945 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:38:45,946 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:39:45,947 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:40:45,948 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:41:45,949 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> >
> > And then finally exits the QuorumCnxManager run loop with the following
> message
> > WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
> broken for id 2
> >
> > How can we ensure the observer does not go out for service such a long
> duration ?
> >
> > Attached the full logs
> >
> > Please help
> > Thanks
> >
> > <zookeeper.log.zip>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message