helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ming Fang <mingf...@mac.com>
Subject Re: Failure detection time
Date Sun, 03 Mar 2013 15:06:24 GMT
Thanks Kishore.

For our system we're going to start small. 
It consist of 1 controller, 1 master, 1 slave.
But the unplanned failover time must be under 1 second.

I tried setting zk.session.timeout to 1000 on the participants but it doesn't seem to make
a difference. It still takes 30 seconds for the controller to detect a killed node. 
Do I have to set this property every, e.g. Zookeeper, controller, and participants?

Sent from my iPad

On Mar 3, 2013, at 1:53 AM, kishore g <g.kishore@gmail.com> wrote:

> There are two kinds of fail over planned( during software upgrade) unplanned( node crash
> For planned, you should add a jvm shutdownhook from which will you invoke helixmanager.disconnect()
and then invoke kill <pid>. This will allow Helix to detect the failure immediately
like 5-15 milli seconds.
> For unplanned, it is determined by zookeeper session timeout, this is by default set
to 30 seconds. You can change this to be more aggressive like 5,10 or 15 seconds. Recommended
value 15 seconds. You can change this by setting system property "zk.session.timeout"= 15*1000.
> helixmanager.flappingTimeWindow and helixmanager.maxDisconnectThreshold can be tuned
in case you have bad network situations and excessive GC's. You probably dont need to tune
this, but let me know if you need additional info on this.
> Fail over depends on number of partitions, nodes, resources etc in the system.  For a
1000 partition system with 10 nodes, failover time for one node might be 200-300 milliseconds.

> Jason has done lot of performance improvements on another branch that might improve this
time further. 
> thanks,
> Kishore G
> On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <mingfang@mac.com> wrote:
>> How can I tune the amount of time it takes for detecting a failed node, e.g. kill
>> Is it by setting "helixmanager.flappingTimeWindow"?
>> What is the fastest possible time for a failover?

View raw message