Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates
 209.85.223.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <50F6039D.5050201@comcast.net>
References: 
 <CA+NDPecJFM5yOfLjG404=c8EJnO3hGeDChUD3iOmrs+vbe_Yww@mail.gmail.com>
 <CAOcnVr02EzHw4K2XEyDbMVtue=x+ne_NjMtC+HrTj0LbxG8MBg@mail.gmail.com>
 <50F6039D.5050201@comcast.net>
From: Harsh J <harsh@cloudera.com>
Date: Wed, 16 Jan 2013 09:44:10 +0530
Message-ID: 
 <CAOcnVr1dxdwhSOeAbj--q3akrbu7f-P5eJdyt6nOzOPVKmA-kA@mail.gmail.com>
Subject: Re: hadoop namenode recovery
To: "<user@hadoop.apache.org>" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=14dae9340c01627aef04d3601df2

--14dae9340c01627aef04d3601df2
Content-Type: text/plain; charset=ISO-8859-1

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects
it out and continues with the local disk. If auto-restore is configured, it
will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net> wrote:

> What happens to the NN and/or performance if there's a problem with the
> NFS server? Or the network?
>
> Thanks,
> randy
>
>
> On 01/14/2013 11:36 PM, Harsh J wrote:
>
>> Its very rare to observe an NN crash due to a software bug in
>> production. Most of the times its a hardware fault you should worry about.
>>
>> On 1.x, or any non-HA-carrying release, the best you can get to
>> safeguard against a total loss is to have redundant disk volumes
>> configured, one preferably over a dedicated remote NFS mount. This way
>> the NN is recoverable after the node goes down, since you can retrieve a
>> current copy from another machine (i.e. via the NFS mount) and set a new
>> node up to replace the older NN and continue along.
>>
>> A load balancer will not work as the NN is not a simple webserver - it
>> maintains state which you cannot sync. We wrote HA-HDFS features to
>> address the very concern you have.
>>
>> If you want true, painless HA, branch-2 is your best bet at this point.
>> An upcoming 2.0.3 release should include the QJM based HA features that
>> is painless to setup and very reliable to use (over other options), and
>> works with commodity level hardware. FWIW, we've (my team and I) been
>> supporting several users and customers who're running the 2.x based HA
>> in production and other types of environments and it has been greatly
>> stable in our experience. There are also some folks in the community
>> running 2.x based HDFS for HA/else.
>>
>>
>> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
>> <mailto:ouchwhisper@gmail.com>**> wrote:
>>
>>     Hello,
>>
>>     Is there a standard way to prevent the failure of Namenode crash in
>>     a Hadoop cluster?
>>     or what is the standard or best practice for overcoming the Single
>>     point failure problem of Hadoop.
>>
>>     I am not ready to take chances on a production server with Hadoop
>>     2.0 Alpha release, which claims to have solved the problem. Are
>>     there any other things I can do to either prevent the failure or
>>     recover from the failure in a very short time.
>>
>>     Thanking You,
>>
>>     --
>>     Regards,
>>     Ouch Whisper
>>     010101010101
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

--14dae9340c01627aef04d3601df2
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The NFS mount is to be soft-mounted; so if the NFS goes do=
wn, the NN ejects it out and continues with the local disk. If auto-restore=
 is configured, it will re-add the NFS if its detected good again later.</d=
iv>

<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, Jan 1=
6, 2013 at 7:04 AM, randy <span dir=3D"ltr">&lt;<a href=3D"mailto:randysch@=
comcast.net" target=3D"_blank">randysch@comcast.net</a>&gt;</span> wrote:<b=
r><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex">

What happens to the NN and/or performance if there&#39;s a problem with the=
 NFS server? Or the network?<br>
<br>
Thanks,<br>
randy<div class=3D"im"><br>
<br>
On 01/14/2013 11:36 PM, Harsh J wrote:<br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex"><div class=3D"im">
Its very rare to observe an NN crash due to a software bug in<br>
production. Most of the times its a hardware fault you should worry about.<=
br>
<br>
On 1.x, or any non-HA-carrying release, the best you can get to<br>
safeguard against a total loss is to have redundant disk volumes<br>
configured, one preferably over a dedicated remote NFS mount. This way<br>
the NN is recoverable after the node goes down, since you can retrieve a<br=
>
current copy from another machine (i.e. via the NFS mount) and set a new<br=
>
node up to replace the older NN and continue along.<br>
<br>
A load balancer will not work as the NN is not a simple webserver - it<br>
maintains state which you cannot sync. We wrote HA-HDFS features to<br>
address the very concern you have.<br>
<br>
If you want true, painless HA, branch-2 is your best bet at this point.<br>
An upcoming 2.0.3 release should include the QJM based HA features that<br>
is painless to setup and very reliable to use (over other options), and<br>
works with commodity level hardware. FWIW, we&#39;ve (my team and I) been<b=
r>
supporting several users and customers who&#39;re running the 2.x based HA<=
br>
in production and other types of environments and it has been greatly<br>
stable in our experience. There are also some folks in the community<br>
running 2.x based HDFS for HA/else.<br>
<br>
<br>
On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper &lt;<a href=3D"mailto:ouch=
whisper@gmail.com" target=3D"_blank">ouchwhisper@gmail.com</a><br></div><di=
v class=3D"im">
&lt;mailto:<a href=3D"mailto:ouchwhisper@gmail.com" target=3D"_blank">ouchw=
hisper@gmail.com</a>&gt;<u></u>&gt; wrote:<br>
<br>
=A0 =A0 Hello,<br>
<br>
=A0 =A0 Is there a standard way to prevent the failure of Namenode crash in=
<br>
=A0 =A0 a Hadoop cluster?<br>
=A0 =A0 or what is the standard or best practice for overcoming the Single<=
br>
=A0 =A0 point failure problem of Hadoop.<br>
<br>
=A0 =A0 I am not ready to take chances on a production server with Hadoop<b=
r>
=A0 =A0 2.0 Alpha release, which claims to have solved the problem. Are<br>
=A0 =A0 there any other things I can do to either prevent the failure or<br=
>
=A0 =A0 recover from the failure in a very short time.<br>
<br>
=A0 =A0 Thanking You,<br>
<br>
=A0 =A0 --<br>
=A0 =A0 Regards,<br>
=A0 =A0 Ouch Whisper<br>
=A0 =A0 010101010101<br>
<br>
<br>
<br>
<br>
--<br>
Harsh J<br>
</div></blockquote>
<br>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Harsh J
</div>

--14dae9340c01627aef04d3601df2--