curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Stribling <>
Subject Re: adding a "network timeout" to curator?
Date Thu, 27 Feb 2014 06:24:39 GMT
Actually, in our case it's mostly about trying to detect a true network 
or process failure quickly, without having to wait the entire, long 
session timeout that's needed because of slow disks.

Now that I think about it a bit more though, I don't think we can get 
what we want entirely on the client side.  Really, what we want is fast 
leader failover among the clients when there is a real network/process 
failure, without risking false session expirations due to slow disks.  
However, in my proposal the server still won't expire the client's 
session until the full session timeout elapses, since it doesn't know 
about the client's 'ruok' protocol.  What I'm proposing would only allow 
a client to reconnect to a different server quickly, it wouldn't affect 
other clients' view of the session.

Hmm, maybe back to the drawing board.  Thanks for listening, anyway.


On 02/26/2014 10:16 PM, Jordan Zimmerman wrote:
> I see. So, this is a slow network. You’d like a heuristic that puts 
> Curator into SUSPENDED mode when the network performance drops. Sounds 
> interesting to me.
> -JZ
> ------------------------------------------------------------------------
> From: Jeremy Stribling Jeremy Stribling <>
> Reply: Jeremy Stribling <>
> Date: February 27, 2014 at 11:15:46 AM
> To: Jordan Zimmerman 
> <>, 
> <>
> Subject: Re: adding a "network timeout" to curator?
>> Please correct me if I'm wrong, but I thought Curator went into 
>> SUSPENDED mode when it gets a Disconnected state event from its ZK 
>> client.  That is not necessarily the same as a network issue, because 
>> that ZK keepalive could be stuck in the ZK server processing queue, 
>> blocked on a slow disk.  What I'm proposing would be a true, 
>> network-only timeout that could be used to declare a client 
>> disconnected quickly if there's a network issue, without having to 
>> reduce the ZK session timeout so low that a slow disk would cause 
>> false negatives. Does that make sense?
>> Jeremy
>> On 02/26/2014 09:25 PM, Jordan Zimmerman wrote:
>>> Curator should already go into SUSPENDED when there is a connection 
>>> issue, right? How would this be different?
>>> -JZ
>>> ------------------------------------------------------------------------
>>> From: Jeremy Stribling Jeremy Stribling <>
>>> Reply: 
>>> <>
>>> Date: February 26, 2014 at 7:56:26 AM
>>> To: 
>>> <>
>>> Subject: adding a "network timeout" to curator?
>>>> Hi all,
>>>> I started a thread on the ZK list a while back about timeouts in ZK.
>>>> You can find it in the archives here:
>>>> The basic idea is that when ZK is running on a node with slow disks
>>>> (e.g., in a VM), you might want to set your session timeout to a long
>>>> value (e.g., 30 seconds or 60 seconds), but still detect network
>>>> timeouts quickly. On that thread, Michi proposed using 'ruok' commands
>>>> from the client to test network connectivity, along with the normal
>>>> client pings happening in the background to detect server slowness.
>>>> I was wondering if this would make sense to provide as part of the
>>>> Curator Framework or Client. There could be some background thread
>>>> sending 'ruok' commands to whatever server the client is connected to,
>>>> and going into SUSPENDED (or LOST?) mode when it hits a timeout or gets
>>>> a failure back. We might be able to implement something like that here
>>>> and contribute it back, if it sounds interesting to other people and we
>>>> can agree on a design. Any thoughts?
>>>> Jeremy

View raw message