helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ming Fang <mingf...@mac.com>
Subject Re: Failure detection time
Date Sun, 03 Mar 2013 18:34:02 GMT
I've tried setting zk.session.timeout property from my participants but I don't think it's
Looking at org.apache.helix.manager.zk.ZKHelixManager line 155, it seems the session timeout
is made same value as helixmanager.flappingTimeWindow.
That looks like a bug since these two values are for different purposes.

As a temporary workaround, this is a hack that works

            manager = HelixManagerFactory.getZKHelixManager(CLUSTER_NAME, instanceName, InstanceType.PARTICIPANT,
                //hack to set sessionTimeout
                Field sessionTimeout = ZKHelixManager.class.getDeclaredField("_sessionTimeout");
                sessionTimeout.setInt(manager, 1000);

Also on the Zookeeper side I made the tickTime =500 and minSessionTimeout = 1000.

On Mar 3, 2013, at 1:53 AM, kishore g <g.kishore@gmail.com> wrote:

> There are two kinds of fail over planned( during software upgrade) unplanned( node crash
> For planned, you should add a jvm shutdownhook from which will you invoke helixmanager.disconnect()
and then invoke kill <pid>. This will allow Helix to detect the failure immediately
like 5-15 milli seconds.
> For unplanned, it is determined by zookeeper session timeout, this is by default set
to 30 seconds. You can change this to be more aggressive like 5,10 or 15 seconds. Recommended
value 15 seconds. You can change this by setting system property "zk.session.timeout"= 15*1000.
> helixmanager.flappingTimeWindow and helixmanager.maxDisconnectThreshold can be tuned
in case you have bad network situations and excessive GC's. You probably dont need to tune
this, but let me know if you need additional info on this.
> Fail over depends on number of partitions, nodes, resources etc in the system.  For a
1000 partition system with 10 nodes, failover time for one node might be 200-300 milliseconds.

> Jason has done lot of performance improvements on another branch that might improve this
time further. 
> thanks,
> Kishore G
> On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <mingfang@mac.com> wrote:
> How can I tune the amount of time it takes for detecting a failed node, e.g. kill -9?
> Is it by setting "helixmanager.flappingTimeWindow"?
> What is the fastest possible time for a failover?

View raw message