helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Failure detection time
Date Sun, 03 Mar 2013 06:53:49 GMT
There are two kinds of fail over planned( during software upgrade)
unplanned( node crash etc).

For planned, you should add a jvm shutdownhook from which will you invoke
helixmanager.disconnect() and then invoke kill <pid>. This will allow Helix
to detect the failure immediately like 5-15 milli seconds.

For unplanned, it is determined by zookeeper session timeout, this is by
default set to 30 seconds. You can change this to be more aggressive like
5,10 or 15 seconds. Recommended value 15 seconds. You can change this by
setting system property "zk.session.timeout"= 15*1000.

helixmanager.flappingTimeWindow and helixmanager.maxDisconnectThreshold can
be tuned in case you have bad network situations and excessive GC's. You
probably dont need to tune this, but let me know if you need additional
info on this.

Fail over depends on number of partitions, nodes, resources etc in the
system.  For a 1000 partition system with 10 nodes, failover time for one
node might be 200-300 milliseconds.

Jason has done lot of performance improvements on another branch that might
improve this time further.

Kishore G

On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <mingfang@mac.com> wrote:

> How can I tune the amount of time it takes for detecting a failed node,
> e.g. kill -9?
> Is it by setting "helixmanager.flappingTimeWindow"?
> What is the fastest possible time for a failover?

View raw message