zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris <c.turks...@gmail.com>
Subject Re: Leader election failing
Date Mon, 03 Sep 2018 15:04:13 GMT
I havent noticed it in 3.4 back when we used it , but i can do a test to 
confirm it. I will let you know in appx one week.
Regards
Chris

On 3 September 2018 4:56:00 pm Andor Molnar <andor@cloudera.com.INVALID> wrote:

> Thanks for testing Chris.
>
> So, if I understand you correctly, you're running the latest version from
> branch-3.5. Could we say that this is a 3.5-only problem?
> Have you ever tested the same cluster with 3.4?
>
> Regards,
> Andor
>
>
>
> On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turksema@gmail.com> wrote:
>
>> I've tested the patch and let it run 6 days. It did not help, result is
>> still the same. (remaining ZKs form islands based on datacenter they are
>> in).
>>
>> I have mitigated it by doing a daily rolling restart.
>>
>> Regards,
>> Chris
>>
>> On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <andor@cloudera.com.invalid>
>> wrote:
>>
>>> Hi Chris,
>>>
>>> Would you mind testing the following patch on your test clusters?
>>> I'm not entirely sure, but the issue might be related.
>>>
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>>>
>>> Regards,
>>> Andor
>>>
>>>
>>>
>>> On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <camille@apache.org>
>>> wrote:
>>>
>>>> If you have the time and inclination, next time you see this problem in
>>>> your test clusters get stack traces and any other diagnostics possible
>>>> before restarting. I'm not an expert at network debugging but if you
>> have
>>>> someone who is you might want them to take a look at the connections
>> and
>>>> settings of any switches/firewalls/etc involved, see if there's any
>>> unusual
>>>> configurations or evidence of other long-lived connections failing
>> (even
>>> if
>>>> their services handle the failures more gracefully). Send us the stack
>>>> traces also it would be interesting to take a look.
>>>>
>>>> C
>>>>
>>>>
>>>> On Wed, Aug 8, 2018, 11:09 AM Chris <c.turksema@gmail.com> wrote:
>>>>
>>>>> Running 3.5.5
>>>>>
>>>>> I managed to recreate it on acc and test cluster today, failing on
>>>>> shutdown
>>>>> of leader. Both had been running for over a week. After restarting
>> all
>>>>> zookeepers it runs fine no matter how many leader shutdowns i throw
>> at
>>>> it.
>>>>>
>>>>> On 8 August 2018 5:05:34 pm Andor Molnar <andor@cloudera.com.INVALID
>>>
>>>>> wrote:
>>>>>
>>>>>> Some kind of a network split?
>>>>>>
>>>>>> It looks like 1-2 and 3-4 were able to communicate each other, but
>>>>>> connection timed out between these 2 splits. When 5 came back
>> online
>>> it
>>>>>> started with supporters of (1,2) and later 3 and 4 also joined.
>>>>>>
>>>>>> There was no such issue the day after.
>>>>>>
>>>>>> Which version of ZooKeeper is this? 3.5.something?
>>>>>>
>>>>>> Regards,
>>>>>> Andor
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turksema@gmail.com>
>> wrote:
>>>>>>
>>>>>>> Actually i have similar issues on my test and acceptance clusters
>>>> where
>>>>>>> leader election fails if the cluster has been running for a couple
>>> of
>>>>> days.
>>>>>>> If you stop/start the Zookeepers once they will work fine on
>> further
>>>>>>> disruptions that day. Not sure yet what the treshold is.
>>>>>>>
>>>>>>>
>>>>>>> On 8 August 2018 4:32:56 pm Camille Fournier <camille@apache.org>
>>>>> wrote:
>>>>>>>
>>>>>>> Hard to say. It looks like about 15 minutes after your first
>>> incident
>>>>> where
>>>>>>>> 5 goes down and then comes back up, servers 1 and 2 get socket
>>> errors
>>>>> to
>>>>>>>> their connections with 3, 4, and 6. It's possible if you
had
>> waited
>>>>> those
>>>>>>>> 15 minutes, once those errors cleared the quorum would've
formed
>>> with
>>>>> the
>>>>>>>> other servers. But as for why there were those errors in
the
>> first
>>>>> place
>>>>>>>> it's not clear. Could be a network glitch, or an obscure
bug in
>> the
>>>>>>>> connection logic. Has anyone else ever seen this?
>>>>>>>> If you see it again, getting a stack trace of the servers
when
>> they
>>>>> can't
>>>>>>>> form quorum might be helpful.
>>>>>>>>
>>>>>>>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com>
>>>> wrote:
>>>>>>>>
>>>>>>>> I have a cluster of 5 participants (id 1-5) and 1 observer
(id
>> 6).
>>>>>>>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
>>>>>>>>> Yesterday one of the participants (id5, by chance was
the
>> leader)
>>>> was
>>>>>>>>> rebooted. Although all other servers were online and
not
>> suffering
>>>>> from
>>>>>>>>> networking issues the leader election failed and the
cluster
>>>> remained
>>>>>>>>> "looking" until the old leader came back online after
which it
>> was
>>>>>>>>> promptly
>>>>>>>>> elected as leader again.
>>>>>>>>>
>>>>>>>>> Today we tried the same exercise on the exact same servers,
5
>> was
>>>>> still
>>>>>>>>> leader and was rebooted, and leader election worked fine
with 4
>> as
>>>> new
>>>>>>>>> leader.
>>>>>>>>>
>>>>>>>>> I have included the logs.  From the logs i see that yesterday
>> 1,2
>>>>> never
>>>>>>>>> received new leader proposals from 3,4 and vice versa.
>>>>>>>>> Today all proposals came through. This is not the first
time
>> we've
>>>>> seen
>>>>>>>>> this type of behavior, where some zookeepers can't seem
to find
>>> each
>>>>>>>>> other
>>>>>>>>> after the leader goes down.
>>>>>>>>> All servers use dynamic configuration and have the same
config
>>> node.
>>>>>>>>>
>>>>>>>>> How could this be explained? These servers also host
a
>> replicated
>>>>>>>>> database
>>>>>>>>> cluster and have no history of db replication issues.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>




Mime
View raw message