helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Failure detection time
Date Mon, 04 Mar 2013 05:49:43 GMT
Thanks. Pushed the fix.


On Sun, Mar 3, 2013 at 9:16 PM, Ming Fang <mingfang@mac.com> wrote:

> It's just a one liner fix
>
> https://github.com/mingfang/apache-helix/commit/c7a7a840c9347cb362080619c53db23345b5ed10
>
> I'm afraid writing a proper test to detect session timeout is beyond me at
> this point.
>
> On Mar 3, 2013, at 11:59 PM, kishore g <g.kishore@gmail.com> wrote:
>
> Thanks Ming, good catch. Do you mind submitting a patch and adding a test
> case ?
>
> https://issues.apache.org/jira/browse/HELIX-55
>
> Thanks,
> Kishore G
>
>
>
>
>
> On Sun, Mar 3, 2013 at 10:34 AM, Ming Fang <mingfang@mac.com> wrote:
>
>> I've tried setting zk.session.timeout property from my participants but I
>> don't think it's working.
>> Looking at org.apache.helix.manager.zk.ZKHelixManager line 155, it seems
>> the session timeout is made same value as helixmanager.flappingTimeWindow.
>> That looks like a bug since these two values are for different purposes.
>>
>> As a temporary workaround, this is a hack that works
>>
>>             manager = HelixManagerFactory.getZKHelixManager(CLUSTER_NAME,
>> instanceName, InstanceType.PARTICIPANT, ZK_ADDRESS);
>>             {
>>                 //hack to set sessionTimeout
>>                 Field sessionTimeout =
>> ZKHelixManager.class.getDeclaredField("_sessionTimeout");
>>                 sessionTimeout.setAccessible(true);
>>                 sessionTimeout.setInt(manager, 1000);
>>             }
>>
>> Also on the Zookeeper side I made the tickTime =500 and minSessionTimeout
>> = 1000.
>>
>> On Mar 3, 2013, at 1:53 AM, kishore g <g.kishore@gmail.com> wrote:
>>
>> There are two kinds of fail over planned( during software upgrade)
>> unplanned( node crash etc).
>>
>> For planned, you should add a jvm shutdownhook from which will you invoke
>> helixmanager.disconnect() and then invoke kill <pid>. This will allow Helix
>> to detect the failure immediately like 5-15 milli seconds.
>>
>> For unplanned, it is determined by zookeeper session timeout, this is by
>> default set to 30 seconds. You can change this to be more aggressive like
>> 5,10 or 15 seconds. Recommended value 15 seconds. You can change this by
>> setting system property "zk.session.timeout"= 15*1000.
>>
>> helixmanager.flappingTimeWindow and helixmanager.maxDisconnectThreshold
>> can be tuned in case you have bad network situations and excessive GC's.
>> You probably dont need to tune this, but let me know if you need additional
>> info on this.
>>
>> Fail over depends on number of partitions, nodes, resources etc in the
>> system.  For a 1000 partition system with 10 nodes, failover time for one
>> node might be 200-300 milliseconds.
>>
>> Jason has done lot of performance improvements on another branch that
>> might improve this time further.
>>
>> thanks,
>> Kishore G
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <mingfang@mac.com> wrote:
>>
>>> How can I tune the amount of time it takes for detecting a failed node,
>>> e.g. kill -9?
>>> Is it by setting "helixmanager.flappingTimeWindow"?
>>>
>>> What is the fastest possible time for a failover?
>>
>>
>>
>>
>
>

Mime
View raw message