zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris <c.turks...@gmail.com>
Subject Re: Leader election failing
Date Tue, 11 Sep 2018 19:42:37 GMT
What action should i perform for getting the most usable logs in this case ?

Log level to debug and kill -3 when its failing ?


On 11 September 2018 9:17:45 pm Andor Molnár <andor@apache.org> wrote:

> Erm.
>
> Thanks for carrying out these tests Chris.
>
> Have you by any chance - as Camille suggested - collected debug logs
> from these tests?
>
>
> Andor
>
>
>
> On 09/11/2018 11:08 AM, Cee Tee wrote:
>> Concluded a test with a 3.4.13 cluster, it shows the same behaviour.
>>
>> On Mon, Sep 3, 2018 at 4:56 PM Andor Molnar <andor@cloudera.com.invalid>
>> wrote:
>>
>>> Thanks for testing Chris.
>>>
>>> So, if I understand you correctly, you're running the latest version from
>>> branch-3.5. Could we say that this is a 3.5-only problem?
>>> Have you ever tested the same cluster with 3.4?
>>>
>>> Regards,
>>> Andor
>>>
>>>
>>>
>>> On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turksema@gmail.com> wrote:
>>>
>>>> I've tested the patch and let it run 6 days. It did not help, result is
>>>> still the same. (remaining ZKs form islands based on datacenter they are
>>>> in).
>>>>
>>>> I have mitigated it by doing a daily rolling restart.
>>>>
>>>> Regards,
>>>> Chris
>>>>
>>>> On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <andor@cloudera.com.invalid
>>>>
>>>> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>> Would you mind testing the following patch on your test clusters?
>>>>> I'm not entirely sure, but the issue might be related.
>>>>>
>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>>>>>
>>>>> Regards,
>>>>> Andor
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <camille@apache.org>
>>>>> wrote:
>>>>>
>>>>>> If you have the time and inclination, next time you see this problem
>>> in
>>>>>> your test clusters get stack traces and any other diagnostics
>>> possible
>>>>>> before restarting. I'm not an expert at network debugging but if
you
>>>> have
>>>>>> someone who is you might want them to take a look at the connections
>>>> and
>>>>>> settings of any switches/firewalls/etc involved, see if there's any
>>>>> unusual
>>>>>> configurations or evidence of other long-lived connections failing
>>>> (even
>>>>> if
>>>>>> their services handle the failures more gracefully). Send us the
>>> stack
>>>>>> traces also it would be interesting to take a look.
>>>>>>
>>>>>> C
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 8, 2018, 11:09 AM Chris <c.turksema@gmail.com>
wrote:
>>>>>>
>>>>>>> Running 3.5.5
>>>>>>>
>>>>>>> I managed to recreate it on acc and test cluster today, failing
on
>>>>>>> shutdown
>>>>>>> of leader. Both had been running for over a week. After restarting
>>>> all
>>>>>>> zookeepers it runs fine no matter how many leader shutdowns i
throw
>>>> at
>>>>>> it.
>>>>>>> On 8 August 2018 5:05:34 pm Andor Molnar
>>> <andor@cloudera.com.INVALID
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Some kind of a network split?
>>>>>>>>
>>>>>>>> It looks like 1-2 and 3-4 were able to communicate each other,
>>> but
>>>>>>>> connection timed out between these 2 splits. When 5 came
back
>>>> online
>>>>> it
>>>>>>>> started with supporters of (1,2) and later 3 and 4 also joined.
>>>>>>>>
>>>>>>>> There was no such issue the day after.
>>>>>>>>
>>>>>>>> Which version of ZooKeeper is this? 3.5.something?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Andor
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turksema@gmail.com>
>>>> wrote:
>>>>>>>>> Actually i have similar issues on my test and acceptance
>>> clusters
>>>>>> where
>>>>>>>>> leader election fails if the cluster has been running
for a
>>> couple
>>>>> of
>>>>>>> days.
>>>>>>>>> If you stop/start the Zookeepers once they will work
fine on
>>>> further
>>>>>>>>> disruptions that day. Not sure yet what the treshold
is.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8 August 2018 4:32:56 pm Camille Fournier <
>>> camille@apache.org>
>>>>>>> wrote:
>>>>>>>>> Hard to say. It looks like about 15 minutes after your
first
>>>>> incident
>>>>>>> where
>>>>>>>>>> 5 goes down and then comes back up, servers 1 and
2 get socket
>>>>> errors
>>>>>>> to
>>>>>>>>>> their connections with 3, 4, and 6. It's possible
if you had
>>>> waited
>>>>>>> those
>>>>>>>>>> 15 minutes, once those errors cleared the quorum
would've
>>> formed
>>>>> with
>>>>>>> the
>>>>>>>>>> other servers. But as for why there were those errors
in the
>>>> first
>>>>>>> place
>>>>>>>>>> it's not clear. Could be a network glitch, or an
obscure bug in
>>>> the
>>>>>>>>>> connection logic. Has anyone else ever seen this?
>>>>>>>>>> If you see it again, getting a stack trace of the
servers when
>>>> they
>>>>>>> can't
>>>>>>>>>> form quorum might be helpful.
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com>
>>>>>> wrote:
>>>>>>>>>> I have a cluster of 5 participants (id 1-5) and 1
observer (id
>>>> 6).
>>>>>>>>>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter
B.
>>>>>>>>>>> Yesterday one of the participants (id5, by chance
was the
>>>> leader)
>>>>>> was
>>>>>>>>>>> rebooted. Although all other servers were online
and not
>>>> suffering
>>>>>>> from
>>>>>>>>>>> networking issues the leader election failed
and the cluster
>>>>>> remained
>>>>>>>>>>> "looking" until the old leader came back online
after which it
>>>> was
>>>>>>>>>>> promptly
>>>>>>>>>>> elected as leader again.
>>>>>>>>>>>
>>>>>>>>>>> Today we tried the same exercise on the exact
same servers, 5
>>>> was
>>>>>>> still
>>>>>>>>>>> leader and was rebooted, and leader election
worked fine with
>>> 4
>>>> as
>>>>>> new
>>>>>>>>>>> leader.
>>>>>>>>>>>
>>>>>>>>>>> I have included the logs.  From the logs i see
that yesterday
>>>> 1,2
>>>>>>> never
>>>>>>>>>>> received new leader proposals from 3,4 and vice
versa.
>>>>>>>>>>> Today all proposals came through. This is not
the first time
>>>> we've
>>>>>>> seen
>>>>>>>>>>> this type of behavior, where some zookeepers
can't seem to
>>> find
>>>>> each
>>>>>>>>>>> other
>>>>>>>>>>> after the leader goes down.
>>>>>>>>>>> All servers use dynamic configuration and have
the same config
>>>>> node.
>>>>>>>>>>> How could this be explained? These servers also
host a
>>>> replicated
>>>>>>>>>>> database
>>>>>>>>>>> cluster and have no history of db replication
issues.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Chris




Mime
View raw message