zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karol Dudzinski <karoldudzin...@gmail.com>
Subject Re: Leader election duration
Date Wed, 29 Apr 2015 18:05:12 GMT
The snapshots are a bit larger than I'd like (~170MB) but the bit that takes time is actually
before that.

On a successful election, on the new leader I see the following:
- LEADER ELECTION TOOK: 12345
- Synchronizing with Follower...
- Sending DIFF or SNAP

That last bit will take a long time if the snapshot is large, agreed.  However, for me it's
arriving at the LEADER ELECTION TOOK bit that takes a long time.  And the bit between that
and the Synchronizing often times out with the messages I described in my previous email.

Karol

> On 29 Apr 2015, at 18:46, Flavio Junqueira <fpjunqueira@yahoo.com.INVALID> wrote:
> 
> Check the size of your snapshot files. If during election you need to transfer snapshots
and they are large, then recovery is going to take longer. We try to send a diff of the txns
when possible, and sending a diff is much faster compared to send a full snapshot. Sending
a diff isn't always possible, depending on how far behind followers compared to the leader,
so the difference might be due to snapshots and diffs.
> -Flavio  
> 
> 
>     On Wednesday, April 29, 2015 6:32 PM, Camille Fournier <camille@apache.org>
wrote:
> 
> 
> 
> Don't suppose you could share some snippets of the log messages you're
> seeing a lot?
> 
> On Wed, Apr 29, 2015 at 1:29 PM, Karol Dudzinski <karoldudzinski@gmail.com>
> wrote:
> 
>> Did a bit more investigation today though I didn't try the prod snapshot
>> in a test cluster.  I did however find a couple of other strange things
>> which I'd initially missed.
>> 
>> When the leader of this particular cluster goes for a reboot, I see many
>> notification printouts from FastLeaderElection.  On some occasions I've
>> seen nearly 50.  In other clusters I usually see 5 or thereabouts.  The
>> number of these notifications seems to correlate with how long election
>> took.  What could cause it to require exchanging so many messages?
>> 
>> This cluster also seems to often timeout while syncing after selecting the
>> leader.  On the new leader I see the expected number of "Follower sid: ..."
>> messages but it doesn't always get to the "Synchronizing with Follower"
>> part.  On the leader I then get a "Unexpected exception causing shutdown
>> while sock still open" message followed by a GOODBYE for one of the
>> followers.  At the same time on that follower I see a "Exception when
>> following the leader" caused by a read timeout in
>> Learner.registerWithLeader.  Does anything happen prior to syncing that
>> could take a long time?
>> 
>> Thanks,
>> Karol
>> 
>>>> On 28 Apr 2015, at 20:22, Karol Dudzinski <karoldudzinski@gmail.com>
>>> wrote:
>>> 
>>> Well these are prod clusters so my ability to experiment is rather
>> limited.  I can take a copy of the snapshot and try both 3 node and 5 in a
>> test cluster.
>>> 
>>> One thing I forgot to mention is that in most clusters the number of
>> election notification log lines I see is typically, give or take, the same
>> as the number of participants.  In this cluster however, it's typically 2
>> or 3 times as many notifications as the number of participants.
>>> 
>>> My gut feeling is it's more likely to be due to load as the 5 node
>> cluster is much busier and the election time has been increasing over time
>> (as has load).  I have no idea exactly what load though, whether it's
>> number of clients, frequency of transactions, total data size, etc.  I
>> don't understand why though but that may just be my limited knowledge of
>> the election protocol.
>>> 
>>> Karol
>>> 
>>>> On 28 Apr 2015, at 19:54, Camille Fournier <camille@apache.org> wrote:
>>>> 
>>>> Just out of curiosity, if you start the 5 node cluster up with only 3 of
>>>> the nodes to begin with (like, config 5, but only bring up 3 processes),
>>>> does it speed up the leader election or is it still slow?
>>>> 
>>>> C
>>>> 
>>>> On Tue, Apr 28, 2015 at 1:41 PM, Karol Dudzinski <
>> karoldudzinski@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> We're seeing some rather strange leader election in one of our
>> clusters.
>>>>> The duration reported by the "FOLLOWING - LEADER ELECTION TOOK" log
>> line
>>>>> (and equivalent for the leader) seems to vary hugely.  During one
>> rolling
>>>>> reboot, I saw the number reported as small as 39ms and as large as 57
>>>>> seconds (difference in units is not a typo).  The average is just
>> about 10
>>>>> seconds and std dev also about 10 seconds.  So the time taken is not
>> only
>>>>> quite large, it's also very variable.
>>>>> 
>>>>> We have other clusters but the average election time in those is in the
>>>>> hundreds of millis with std dev in a similar ballpark.  I guess one
>>>>> difference is the "slow" cluster is 5 participants while the others
>> are 3,
>>>>> which may be a factor but I wouldn't expect it to make two orders of
>>>>> magnitude difference!
>>>>> 
>>>>> So my question is, what factors contribute to the election time
>> reported
>>>>> by these log lines? And what can we do to speed this up?
>>>>> 
>>>>> As far as I understand from logs and a quick browse through the code
>> that
>>>>> time is the time to select a leader.  Syncing up to the leader happens
>>>>> after that.  The syncing part I can understand will vary depending on
>> load
>>>>> but I don't see why selecting the leader would.
>>>>> 
>>>>> Thanks,
>>>>> Karol
> 
> 
> 

Mime
View raw message