zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@apache.org>
Subject Re: Observers Unusable
Date Wed, 14 Oct 2015 19:50:31 GMT
The snapshot is really small, so it is unlikely to be it, unless the connection is really crappy.

-Flavio

> On 14 Oct 2015, at 19:38, elastic search <elastic.l.k@gmail.com> wrote:
> 
> Hi Flavio
> 
> Thanks for the reply.
> 
> These are the config parameters
> tickTime=2000
> initLimit=10
> syncLimit=5
> 
> Data size is
> -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
> -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
> -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
> -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
> -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
> 
> These are all defaults and i have never deleted any log files.
> Do you still think this could be due to syncLimit or the snapshot being
> large
> 
> 
> 
> On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fpj@apache.org> wrote:
> 
>> 
>>> On 14 Oct 2015, at 17:22, elastic search <elastic.l.k@gmail.com> wrote:
>>> 
>>> The link between the AWS and the local DataCenter was down for around 2
>>> minutes.
>>> I have been running ping continuously from the DataCenter to the AWS and
>>> that wasn't responding for a few minutes.
>>> 
>>> Are you saying since we see Send Notification messages in the logs , that
>>> would mean the Observers are able to connect to the ZK .
>>> Only that the ZK server leader is not able to respond back ?
>> 
>> Yeah, the logs indicate that the observer is being able to connect ("There
>> is a connection already for server..." in the logs) but isn't able to make
>> progress, which implies that it isn't receiving responses.
>> 
>>> 
>>> This is what i see from the Server logs
>>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
>> 10.10.4.46:38161
>>> :LearnerHandler@633] - Unexpected exception causing shutdown while sock
>>> still open
>>> 2015-10-09 16:19:28,160 [myid:3] - WARN
>>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for
>> id
>>> 5, my id = 3, error =
>>> 
>> 
>> These lines belong to different parts of the code path. The first actually
>> indicates that the observer found the leader, but could not keep going. For
>> this particular time period, it could be that the observer is trying to
>> sync up with the leader, but the value of syncLimit is too low and the
>> leader is abandoning the observer. One reason for that to happen is the
>> snapshot being large (or your available bandwidth being low).
>> 
>> The second line is from the part of the code that does connections for
>> leader election. That one also says that the connection broke.
>> 
>> -Flavio
>> 
>> 
>>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fpj@apache.org>
>> wrote:
>>> 
>>>> Can you tell why the server wasn't responding to the notifications from
>>>> the observer? The log file is from the observer and it sounds like it is
>>>> being able to send messages out, but it isn't clear why the server isn't
>>>> responding.
>>>> 
>>>> -Flavio
>>>> 
>>>>> On 14 Oct 2015, at 01:51, elastic search <elastic.l.k@gmail.com>
>> wrote:
>>>>> 
>>>>> 
>>>>> Hello Experts
>>>>> 
>>>>> We have 2 Observers running in AWS connecting over to local ZK Ensemble
>>>> in our own DataCenter.
>>>>> 
>>>>> There have been instances where we see network drop for a minute
>> between
>>>> the networks.
>>>>> However the Observers take around 15 minutes to recover even if the
>>>> network outage is for a minute.
>>>>> 
>>>>> From the logs
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 400
>>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 800
>>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 1600
>>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 3200
>>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 6400
>>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 12800
>>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 25600
>>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 51200
>>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 
>>>>> And then finally exits the QuorumCnxManager run loop with the following
>>>> message
>>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
>>>> broken for id 2
>>>>> 
>>>>> How can we ensure the observer does not go out for service such a long
>>>> duration ?
>>>>> 
>>>>> Attached the full logs
>>>>> 
>>>>> Please help
>>>>> Thanks
>>>>> 
>>>>> <zookeeper.log.zip>
>>>> 
>>>> 
>> 
>> 


Mime
View raw message