helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ming Fang <mingf...@mac.com>
Subject Re: Failure detection time
Date Mon, 04 Mar 2013 05:16:51 GMT
It's just a one liner fix
https://github.com/mingfang/apache-helix/commit/c7a7a840c9347cb362080619c53db23345b5ed10

I'm afraid writing a proper test to detect session timeout is beyond me at this point.

On Mar 3, 2013, at 11:59 PM, kishore g <g.kishore@gmail.com> wrote:

> Thanks Ming, good catch. Do you mind submitting a patch and adding a test case ?
> 
> https://issues.apache.org/jira/browse/HELIX-55
> 
> Thanks,
> Kishore G
> 
> 
> 
> 
> 
> On Sun, Mar 3, 2013 at 10:34 AM, Ming Fang <mingfang@mac.com> wrote:
> I've tried setting zk.session.timeout property from my participants but I don't think
it's working.
> Looking at org.apache.helix.manager.zk.ZKHelixManager line 155, it seems the session
timeout is made same value as helixmanager.flappingTimeWindow.
> That looks like a bug since these two values are for different purposes.
> 
> As a temporary workaround, this is a hack that works
> 
>             manager = HelixManagerFactory.getZKHelixManager(CLUSTER_NAME, instanceName,
InstanceType.PARTICIPANT, ZK_ADDRESS);
>             {
>                 //hack to set sessionTimeout
>                 Field sessionTimeout = ZKHelixManager.class.getDeclaredField("_sessionTimeout");
>                 sessionTimeout.setAccessible(true);
>                 sessionTimeout.setInt(manager, 1000);
>             }
> 
> Also on the Zookeeper side I made the tickTime =500 and minSessionTimeout = 1000.
> 
> On Mar 3, 2013, at 1:53 AM, kishore g <g.kishore@gmail.com> wrote:
> 
>> There are two kinds of fail over planned( during software upgrade) unplanned( node
crash etc). 
>> 
>> For planned, you should add a jvm shutdownhook from which will you invoke helixmanager.disconnect()
and then invoke kill <pid>. This will allow Helix to detect the failure immediately
like 5-15 milli seconds.
>> 
>> For unplanned, it is determined by zookeeper session timeout, this is by default
set to 30 seconds. You can change this to be more aggressive like 5,10 or 15 seconds. Recommended
value 15 seconds. You can change this by setting system property "zk.session.timeout"= 15*1000.
>> 
>> helixmanager.flappingTimeWindow and helixmanager.maxDisconnectThreshold can be tuned
in case you have bad network situations and excessive GC's. You probably dont need to tune
this, but let me know if you need additional info on this.
>> 
>> Fail over depends on number of partitions, nodes, resources etc in the system.  For
a 1000 partition system with 10 nodes, failover time for one node might be 200-300 milliseconds.

>> 
>> Jason has done lot of performance improvements on another branch that might improve
this time further. 
>> 
>> thanks,
>> Kishore G
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <mingfang@mac.com> wrote:
>> How can I tune the amount of time it takes for detecting a failed node, e.g. kill
-9?
>> Is it by setting "helixmanager.flappingTimeWindow"?
>> 
>> What is the fastest possible time for a failover?
>> 
> 
> 


Mime
View raw message