Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13643E09A for ; Wed, 16 Jan 2013 04:15:04 +0000 (UTC) Received: (qmail 61751 invoked by uid 500); 16 Jan 2013 04:14:59 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 61443 invoked by uid 500); 16 Jan 2013 04:14:58 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 61404 invoked by uid 99); 16 Jan 2013 04:14:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jan 2013 04:14:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.223.180 as permitted sender) Received: from [209.85.223.180] (HELO mail-ie0-f180.google.com) (209.85.223.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jan 2013 04:14:50 +0000 Received: by mail-ie0-f180.google.com with SMTP id c10so1668502ieb.25 for ; Tue, 15 Jan 2013 20:14:30 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=+n4S2ZZYbhm+YXRfy/wVe/ljqyKEUFvLPWID4IESXnk=; b=ctS8X9KFhSmmlYX0QSGONQNypDhhU5g6mYlvMvzlfHrXUWEIMY6vxJfHldoAAxS98p 6N+pkEXQawnj5H4echvcMalJX3RYdnUGlqCyQHbE8nVvT3QUSxMu2zvo7I3Vr/+MdmQz rjJTzbho/R19NdViCvywJuZYjdAwpq4cp6Lnb6Ux5iJnqQUOdn029WsHbB1D8pRU6AgR OiKj9VbDY/ydPKoyRFHIhCPseTh3iWq5eG4g8aOUefvz7k31YOCVtU2mYxiq+9D5gh+n 40Gz9Tx8d1bN0ZP70vzjW9U//MSsXolohKdDsfAMjfkKhEFtbb4SrWL21wU3kYRBGniI IJbA== Received: by 10.50.193.167 with SMTP id hp7mr3786941igc.18.1358309670156; Tue, 15 Jan 2013 20:14:30 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.32.166 with HTTP; Tue, 15 Jan 2013 20:14:10 -0800 (PST) In-Reply-To: <50F6039D.5050201@comcast.net> References: <50F6039D.5050201@comcast.net> From: Harsh J Date: Wed, 16 Jan 2013 09:44:10 +0530 Message-ID: Subject: Re: hadoop namenode recovery To: "" Content-Type: multipart/alternative; boundary=14dae9340c01627aef04d3601df2 X-Gm-Message-State: ALoCoQn8sBRDGNUL1+P+TQ3Lfyw8T94KXFkCg65mDEiC1+vtwQWQYEDYLhVTYDHrsPbk/jmqeg+q X-Virus-Checked: Checked by ClamAV on apache.org --14dae9340c01627aef04d3601df2 Content-Type: text/plain; charset=ISO-8859-1 The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects it out and continues with the local disk. If auto-restore is configured, it will re-add the NFS if its detected good again later. On Wed, Jan 16, 2013 at 7:04 AM, randy wrote: > What happens to the NN and/or performance if there's a problem with the > NFS server? Or the network? > > Thanks, > randy > > > On 01/14/2013 11:36 PM, Harsh J wrote: > >> Its very rare to observe an NN crash due to a software bug in >> production. Most of the times its a hardware fault you should worry about. >> >> On 1.x, or any non-HA-carrying release, the best you can get to >> safeguard against a total loss is to have redundant disk volumes >> configured, one preferably over a dedicated remote NFS mount. This way >> the NN is recoverable after the node goes down, since you can retrieve a >> current copy from another machine (i.e. via the NFS mount) and set a new >> node up to replace the older NN and continue along. >> >> A load balancer will not work as the NN is not a simple webserver - it >> maintains state which you cannot sync. We wrote HA-HDFS features to >> address the very concern you have. >> >> If you want true, painless HA, branch-2 is your best bet at this point. >> An upcoming 2.0.3 release should include the QJM based HA features that >> is painless to setup and very reliable to use (over other options), and >> works with commodity level hardware. FWIW, we've (my team and I) been >> supporting several users and customers who're running the 2.x based HA >> in production and other types of environments and it has been greatly >> stable in our experience. There are also some folks in the community >> running 2.x based HDFS for HA/else. >> >> >> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper > **> wrote: >> >> Hello, >> >> Is there a standard way to prevent the failure of Namenode crash in >> a Hadoop cluster? >> or what is the standard or best practice for overcoming the Single >> point failure problem of Hadoop. >> >> I am not ready to take chances on a production server with Hadoop >> 2.0 Alpha release, which claims to have solved the problem. Are >> there any other things I can do to either prevent the failure or >> recover from the failure in a very short time. >> >> Thanking You, >> >> -- >> Regards, >> Ouch Whisper >> 010101010101 >> >> >> >> >> -- >> Harsh J >> > > -- Harsh J --14dae9340c01627aef04d3601df2 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
The NFS mount is to be soft-mounted; so if the NFS goes do= wn, the NN ejects it out and continues with the local disk. If auto-restore= is configured, it will re-add the NFS if its detected good again later.


On Wed, Jan 1= 6, 2013 at 7:04 AM, randy <randysch@comcast.net> wrote:
What happens to the NN and/or performance if there's a problem with the= NFS server? Or the network?

Thanks,
randy


On 01/14/2013 11:36 PM, Harsh J wrote:
Its very rare to observe an NN crash due to a software bug in
production. Most of the times its a hardware fault you should worry about.<= br>
On 1.x, or any non-HA-carrying release, the best you can get to
safeguard against a total loss is to have redundant disk volumes
configured, one preferably over a dedicated remote NFS mount. This way
the NN is recoverable after the node goes down, since you can retrieve a current copy from another machine (i.e. via the NFS mount) and set a new node up to replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to
address the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point.
An upcoming 2.0.3 release should include the QJM based HA features that
is painless to setup and very reliable to use (over other options), and
works with commodity level hardware. FWIW, we've (my team and I) been supporting several users and customers who're running the 2.x based HA<= br> in production and other types of environments and it has been greatly
stable in our experience. There are also some folks in the community
running 2.x based HDFS for HA/else.


On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
<mailto:ouchw= hisper@gmail.com>> wrote:

=A0 =A0 Hello,

=A0 =A0 Is there a standard way to prevent the failure of Namenode crash in=
=A0 =A0 a Hadoop cluster?
=A0 =A0 or what is the standard or best practice for overcoming the Single<= br> =A0 =A0 point failure problem of Hadoop.

=A0 =A0 I am not ready to take chances on a production server with Hadoop =A0 =A0 2.0 Alpha release, which claims to have solved the problem. Are
=A0 =A0 there any other things I can do to either prevent the failure or =A0 =A0 recover from the failure in a very short time.

=A0 =A0 Thanking You,

=A0 =A0 --
=A0 =A0 Regards,
=A0 =A0 Ouch Whisper
=A0 =A0 010101010101




--
Harsh J




--
Harsh J
--14dae9340c01627aef04d3601df2--