Mailing-List: contact user-help@helix.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@helix.incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of mingfang@mac.com designates
 17.172.204.239 as permitted sender)
From: Ming Fang <mingfang@mac.com>
Content-type: multipart/alternative;
 boundary="Apple-Mail=_18C32BC7-0F6C-4BF7-876D-07EE15D745BE"
Message-id: <893A02E0-A7A2-4061-A6B7-388BBC671607@mac.com>
MIME-version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Failure detection time
Date: Sun, 03 Mar 2013 13:34:02 -0500
References: <E0A19CB8-9536-4C22-8A9B-873F428073BA@mac.com>
 <CABaj-QZYUAUOb23NsAH0LF=wmU+BHC+Ejq9MBsmuucrgYEdqzw@mail.gmail.com>
To: user@helix.incubator.apache.org
In-reply-to: 
 <CABaj-QZYUAUOb23NsAH0LF=wmU+BHC+Ejq9MBsmuucrgYEdqzw@mail.gmail.com>


--Apple-Mail=_18C32BC7-0F6C-4BF7-876D-07EE15D745BE
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

I've tried setting zk.session.timeout property from my participants but =
I don't think it's working.
Looking at org.apache.helix.manager.zk.ZKHelixManager line 155, it seems =
the session timeout is made same value as =
helixmanager.flappingTimeWindow.
That looks like a bug since these two values are for different purposes.

As a temporary workaround, this is a hack that works

            manager =3D =
HelixManagerFactory.getZKHelixManager(CLUSTER_NAME, instanceName, =
InstanceType.PARTICIPANT, ZK_ADDRESS);
            {
                //hack to set sessionTimeout
                Field sessionTimeout =3D =
ZKHelixManager.class.getDeclaredField("_sessionTimeout");
                sessionTimeout.setAccessible(true);
                sessionTimeout.setInt(manager, 1000);
            }

Also on the Zookeeper side I made the tickTime =3D500 and =
minSessionTimeout =3D 1000.

On Mar 3, 2013, at 1:53 AM, kishore g <g.kishore@gmail.com> wrote:

> There are two kinds of fail over planned( during software upgrade) =
unplanned( node crash etc).=20
>=20
> For planned, you should add a jvm shutdownhook from which will you =
invoke helixmanager.disconnect() and then invoke kill <pid>. This will =
allow Helix to detect the failure immediately like 5-15 milli seconds.
>=20
> For unplanned, it is determined by zookeeper session timeout, this is =
by default set to 30 seconds. You can change this to be more aggressive =
like 5,10 or 15 seconds. Recommended value 15 seconds. You can change =
this by setting system property "zk.session.timeout"=3D 15*1000.
>=20
> helixmanager.flappingTimeWindow and =
helixmanager.maxDisconnectThreshold can be tuned in case you have bad =
network situations and excessive GC's. You probably dont need to tune =
this, but let me know if you need additional info on this.
>=20
> Fail over depends on number of partitions, nodes, resources etc in the =
system.  For a 1000 partition system with 10 nodes, failover time for =
one node might be 200-300 milliseconds.=20
>=20
> Jason has done lot of performance improvements on another branch that =
might improve this time further.=20
>=20
> thanks,
> Kishore G
>=20
>=20
>=20
>=20
>=20
>=20
>=20
>=20
> On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <mingfang@mac.com> wrote:
> How can I tune the amount of time it takes for detecting a failed =
node, e.g. kill -9?
> Is it by setting "helixmanager.flappingTimeWindow"?
>=20
> What is the fastest possible time for a failover?
>=20


--Apple-Mail=_18C32BC7-0F6C-4BF7-876D-07EE15D745BE
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Diso-8859-1"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I've =
tried setting zk.session.timeout property from my participants but I =
don't think it's working.<div>Looking at =
org.apache.helix.manager.zk.ZKHelixManager line 155, it seems the =
session timeout is made same value =
as&nbsp;helixmanager.flappingTimeWindow.</div><div>That looks like a bug =
since these two values are for different =
purposes.</div><div><br></div><div>As a temporary workaround, this is a =
hack that works</div><div><br></div><div><div>&nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; manager =3D =
HelixManagerFactory.getZKHelixManager(CLUSTER_NAME, instanceName, =
InstanceType.PARTICIPANT, ZK_ADDRESS);</div><div>&nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; {</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; //hack to set sessionTimeout</div><div>&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Field sessionTimeout =3D =
ZKHelixManager.class.getDeclaredField("_sessionTimeout");</div><div>&nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
sessionTimeout.setAccessible(true);</div><div>&nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sessionTimeout.setInt(manager, =
1000);</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
}</div><div><br></div><div>Also on the Zookeeper side I made =
the&nbsp;tickTime =3D500 and&nbsp;minSessionTimeout =3D =
1000.</div><div><br></div><div><div>On Mar 3, 2013, at 1:53 AM, kishore =
g &lt;<a href=3D"mailto:g.kishore@gmail.com">g.kishore@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div dir=3D"ltr">There are two kinds of fail over planned( =
during software upgrade) unplanned( node crash =
etc).&nbsp;<div><br></div><div>For planned, you should add a jvm =
shutdownhook from which will you invoke helixmanager.disconnect() and =
then invoke kill &lt;pid&gt;. This will allow Helix to detect the =
failure immediately like 5-15 milli seconds.</div>


<div><br></div><div>For unplanned, it is determined by zookeeper session =
timeout, this is by default set to 30 seconds. You can change this to be =
more aggressive like 5,10 or 15 seconds. Recommended value 15 seconds. =
You can change this by setting system property "zk.session.timeout"=3D =
15*1000.</div>
<div><br></div><div style=3D"">helixmanager.flappingTimeWindow =
and&nbsp;helixmanager.maxDisconnectThreshold can be tuned in case you =
have&nbsp;bad network situations and&nbsp;excessive&nbsp;GC's. You =
probably dont need to tune this, but let me know if you need additional =
info on this.</div>
<div style=3D""><br></div><div style=3D"">Fail over depends on number of =
partitions, nodes, resources etc in the system. &nbsp;For a 1000 =
partition system with 10 nodes, failover time for one node might be =
200-300 milliseconds.&nbsp;</div><div style=3D"">
<br></div><div style=3D"">Jason has done lot of performance improvements =
on another branch that might improve this time further.&nbsp;</div><div =
style=3D""><br></div><div style=3D"">thanks,</div><div style=3D"">Kishore =
G</div><div style=3D""><br></div>
<div style=3D""><br></div><div style=3D""><br></div>


<div><br></div><div><br></div><div><br></div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Sat, Mar 2, 2013 at 9:53 PM, Ming =
Fang <span dir=3D"ltr">&lt;<a href=3D"mailto:mingfang@mac.com" =
target=3D"_blank">mingfang@mac.com</a>&gt;</span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left=
-style:solid;padding-left:1ex">


How can I tune the amount of time it takes for detecting a failed node, =
e.g. kill -9?<br>
Is it by setting "helixmanager.flappingTimeWindow"?<br>
<br>
What is the fastest possible time for a =
failover?</blockquote></div><br></div></div>
</blockquote></div><br></div></body></html>=

--Apple-Mail=_18C32BC7-0F6C-4BF7-876D-07EE15D745BE--