Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of todd@cloudera.com designates
 209.85.215.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <DA8340397F7BAE41B8102757834CEB120F528003@EX-MBOX2.xiaomi.net>
References: <DA8340397F7BAE41B8102757834CEB120F528003@EX-MBOX2.xiaomi.net>
From: Todd Lipcon <todd@cloudera.com>
Date: Mon, 15 Oct 2012 18:24:35 -0700
Message-ID: 
 <CADY20s4uQC49GdLC0Vqrg9fPg7neF5NTypgvj501QBhFn_NRGg@mail.gmail.com>
Subject: Re: Question about namenode HA
To: =?UTF-8?B?6LCi6Imv?= <xieliang@xiaomi.com>
Cc: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=14dae9d682b48e2f2104cc230548

--14dae9d682b48e2f2104cc230548
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Liang,

Answers inline below.

On Sun, Oct 14, 2012 at 8:01 PM, =E8=B0=A2=E8=89=AF <xieliang@xiaomi.com> w=
rote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>
>
There are a few reasons for this design choice:

1)  Like Steve said, it's easier to monitor a process from another process,
rather than self-monitor. Consider, for example, what happens if the NN
somehow gets into a deadlock. The process may still be alive, and a
ZooKeeper thread would keep running, even though it is not successfully
handling any operations. The ZKFC running in a separate process
periodically pings the local NN via RPC to ensure that the RPC server is
still working properly, not deadlocked, etc.

2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
will notice it quite quickly, and delete its own zookeeper node. If the NN
were holding its own ZK session, you would have to wait for the full ZK
session timeout to expire. So the external ZKFC results in a faster
failover time for certain classes of failure.


2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
Yes, this is safe. The design of the QuorumJournalManager ensures that
multiple conflicting writers cannot corrupt your namespace in any way. You
might still consider using sshfence ahead of that, with a short configured
timeout -- this provides "read fencing". Otherwise the old NN could
theoretically serve stale reads for a few seconds before it noticed that it
lost its ZK lease. But it's definitely not critical -- the old NN will
eventually do some kind of write and abort itself. So, I'd recommend
/bin/true as the last configured method in your fencing list with QJM.

Thanks
-Todd

--
Todd Lipcon
Software Engineer, Cloudera

--14dae9d682b48e2f2104cc230548
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Liang,<div><br></div><div>Answers inline below.<br><div><br><div class=
=3D"gmail_quote">On Sun, Oct 14, 2012 at 8:01 PM, =E8=B0=A2=E8=89=AF <span =
dir=3D"ltr">&lt;<a href=3D"mailto:xieliang@xiaomi.com" target=3D"_blank">xi=
eliang@xiaomi.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Todd and other HA experts,<br>
<br>
I&#39;ve two question:<br>
<br>
1) why the zkfc is a seperate process, i mean, what&#39;s the primary desig=
n consideration that we didn&#39;t integrate zkfc features into namenode se=
lf ?<br>
<br></blockquote><div><br></div><div>There are a few reasons for this desig=
n choice:</div><div><br></div><div>1) =C2=A0Like Steve said, it&#39;s easie=
r to monitor a process from another process, rather than self-monitor. Cons=
ider, for example, what happens if the NN somehow gets into a deadlock. The=
 process may still be alive, and a ZooKeeper thread would keep running, eve=
n though it is not successfully handling any operations. The ZKFC running i=
n a separate process periodically pings the local NN via RPC to ensure that=
 the RPC server is still working properly, not deadlocked, etc.</div>

<div><br></div><div>2) If the NN process itself crashes (eg segfault due to=
 bad RAM), the ZKFC will notice it quite quickly, and delete its own zookee=
per node. If the NN were holding its own ZK session, you would have to wait=
 for the full ZK session timeout to expire. So the external ZKFC results in=
 a faster failover time for certain classes of failure.</div>

<div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2) If i deploy CDH4.1(included QJM feature), =C2=A0since QJM can do fencing=
 writer, =C2=A0so can i just config like this safely ?<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &lt;name&gt;dfs.ha.fencing.methods&lt;/name&gt;=
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &lt;value&gt;shell(/bin/true)&lt;/value&gt;<br>
<br></blockquote><div><br></div><div>Yes, this is safe. The design of the Q=
uorumJournalManager ensures that multiple conflicting writers cannot corrup=
t your namespace in any way. You might still consider using sshfence ahead =
of that, with a short configured timeout -- this provides &quot;read fencin=
g&quot;. Otherwise the old NN could theoretically serve stale reads for a f=
ew seconds before it noticed that it lost its ZK lease. But it&#39;s defini=
tely not critical -- the old NN will eventually do some kind of write and a=
bort itself. So, I&#39;d recommend /bin/true as the last configured method =
in your fencing list with QJM.</div>

</div><div><br></div>Thanks</div><div>-Todd<br clear=3D"all"><div><br></div=
>--<br>Todd Lipcon<br>Software Engineer, Cloudera<br>
</div></div>

--14dae9d682b48e2f2104cc230548--