Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C004E375 for ; Wed, 16 Jan 2013 01:35:03 +0000 (UTC) Received: (qmail 87474 invoked by uid 500); 16 Jan 2013 01:34:57 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 87381 invoked by uid 500); 16 Jan 2013 01:34:57 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 87371 invoked by uid 99); 16 Jan 2013 01:34:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jan 2013 01:34:57 +0000 X-ASF-Spam-Status: No, hits=1.0 required=5.0 tests=FREEMAIL_REPLY,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of randysch@comcast.net designates 76.96.62.32 as permitted sender) Received: from [76.96.62.32] (HELO qmta03.westchester.pa.mail.comcast.net) (76.96.62.32) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jan 2013 01:34:48 +0000 Received: from omta13.westchester.pa.mail.comcast.net ([76.96.62.52]) by qmta03.westchester.pa.mail.comcast.net with comcast id oAxL1k00X17dt5G53RaTcM; Wed, 16 Jan 2013 01:34:27 +0000 Received: from localhost.localdomain ([68.48.73.46]) by omta13.westchester.pa.mail.comcast.net with comcast id oRaM1k01C0zv7tg3ZRaTEP; Wed, 16 Jan 2013 01:34:27 +0000 Message-ID: <50F6039D.5050201@comcast.net> Date: Tue, 15 Jan 2013 20:34:21 -0500 From: randy User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: user@hadoop.apache.org Subject: Re: hadoop namenode recovery References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1358300067; bh=Z0h4yZwDxC+3lJNwv0QfxGcmsuPsenHJkIRF3TIK84w=; h=Received:Received:Message-ID:Date:From:MIME-Version:To:Subject: Content-Type; b=EqzTLYwbz6fgotVp3gNMWBt/Wlc10UMkx+qj/6aYoWIO0mTq6xLJSsbp1UwQrobtC jEMqstEh/iRbhabggKoJ5GDrP7HVG4ZX0r2g+o5cQzUW083ddQA4MW3B4GuBP7KkR7 P7ou+KGER3lptFXxyLDrUmB9JlWh6XPkUx1Mel6Gc7a6qfXDIVyx8gq3UufkHPOTAe Xe/FJLg1dD7cdy2Sw9EvFk7ZhBcAyOqW0ceMsdRCTvO6d82/H+yiIT9LtKWz7Ouq9M +8nP7uiiWkxj1ObVW/46sQ7SRRGpBk6Qk7DpntIbew8r+Ur+iPguyZkZkcWzAk4KJR rRDfQrQPIWASg== X-Virus-Checked: Checked by ClamAV on apache.org What happens to the NN and/or performance if there's a problem with the NFS server? Or the network? Thanks, randy On 01/14/2013 11:36 PM, Harsh J wrote: > Its very rare to observe an NN crash due to a software bug in > production. Most of the times its a hardware fault you should worry about. > > On 1.x, or any non-HA-carrying release, the best you can get to > safeguard against a total loss is to have redundant disk volumes > configured, one preferably over a dedicated remote NFS mount. This way > the NN is recoverable after the node goes down, since you can retrieve a > current copy from another machine (i.e. via the NFS mount) and set a new > node up to replace the older NN and continue along. > > A load balancer will not work as the NN is not a simple webserver - it > maintains state which you cannot sync. We wrote HA-HDFS features to > address the very concern you have. > > If you want true, painless HA, branch-2 is your best bet at this point. > An upcoming 2.0.3 release should include the QJM based HA features that > is painless to setup and very reliable to use (over other options), and > works with commodity level hardware. FWIW, we've (my team and I) been > supporting several users and customers who're running the 2.x based HA > in production and other types of environments and it has been greatly > stable in our experience. There are also some folks in the community > running 2.x based HDFS for HA/else. > > > On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper > wrote: > > Hello, > > Is there a standard way to prevent the failure of Namenode crash in > a Hadoop cluster? > or what is the standard or best practice for overcoming the Single > point failure problem of Hadoop. > > I am not ready to take chances on a production server with Hadoop > 2.0 Alpha release, which claims to have solved the problem. Are > there any other things I can do to either prevent the failure or > recover from the failure in a very short time. > > Thanking You, > > -- > Regards, > Ouch Whisper > 010101010101 > > > > > -- > Harsh J