Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F34AED36C for ; Tue, 19 Jun 2012 12:48:04 +0000 (UTC) Received: (qmail 94485 invoked by uid 500); 19 Jun 2012 12:48:01 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 94322 invoked by uid 500); 19 Jun 2012 12:48:01 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 94305 invoked by uid 99); 19 Jun 2012 12:48:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2012 12:48:00 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.161 as permitted sender) Received: from [65.55.111.161] (HELO blu0-omc4-s22.blu0.hotmail.com) (65.55.111.161) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2012 12:47:53 +0000 Received: from BLU0-SMTP31 ([65.55.111.137]) by blu0-omc4-s22.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Tue, 19 Jun 2012 05:47:33 -0700 X-Originating-IP: [173.15.87.37] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [192.168.0.100] ([173.15.87.37]) by BLU0-SMTP31.phx.gbl over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Tue, 19 Jun 2012 05:47:32 -0700 Content-Type: text/plain; charset="iso-8859-1" MIME-Version: 1.0 (Apple Message framework v1278) Subject: Re: Split brain - is it possible in hadoop? From: Michael Segel In-Reply-To: Date: Tue, 19 Jun 2012 07:47:30 -0500 Content-Transfer-Encoding: quoted-printable References: To: common-user@hadoop.apache.org X-Mailer: Apple Mail (2.1278) X-OriginalArrivalTime: 19 Jun 2012 12:47:32.0472 (UTC) FILETIME=[B19E8F80:01CD4E19] X-Virus-Checked: Checked by ClamAV on apache.org In your example, you only have one active Name Node. So how would you = encounter a 'split brain' scenario?=20 Maybe it would be better if you defined what you mean by a split brain? -Mike On Jun 18, 2012, at 8:30 PM, hdev ml wrote: > All hadoop contributors/experts, >=20 > I am trying to simulate split brain in our installation. There are a = few > things we want to know >=20 > 1. Does data corruption happen? > 2. If Yes in #1, how to recover from it. > 3. What are the corrective steps to take in this situation e.g. = killing one > namenode etc >=20 > So to simulate this I took following steps. >=20 > 1. We already have a healthy test cluster, consisting of 4 machines. = One > machine runs namenode and a datanode, other machine runs = secondarynamenode > and a datanode, 3rd runs jobtracker and a datanode, and 4th one just a > datanode. > 2. Copied the hadoop installation folder to a new location in the = datanode. > 3. Kept all configurations same in hdfs-site and core-site xmls, = except > renamed the fs.default.name to a different URI > 4. The namenode directory - dfs.name.dir was pointing to the same = shared > NFS mounted directory to which the main namenode points to. >=20 > I started this standby namenode using following command > bin/hadoop-daemon.sh --config conf --hosts slaves start namenode >=20 > It errored out saying that "the directory is already locked", which is = an > expected behaviour. The directory has been locked by the original = namenode. >=20 > So I changed the dfs.name.dir to some other folder, and issued the = same > command. It fails with message - "namenode has not been formatted", = which > is also expected. >=20 > This makes me think - does splitbrain situation really occur in = hadoop? >=20 > My understanding is that split brain happens because of timeouts on = the > main namenode. The way it happens is, when the timeout occurs, the HA > implementation - Be it Linux HA, Veritas etc., thinks that the main > namenode has died and tries to start the standby namenode. The standby > namenode starts up and then main namenode comes back from the timeout = phase > and starts functioning as if nothing happened, giving rise to 2 = namenodes > in the cluster - Split Brain. >=20 > Considering the error messages and the above understanding, I cannot = point > 2 different namenodes to same directory, because the main namenode = isn't > responding but has locked the directory. >=20 > So can I safely conclude that split brain does not occur in hadoop? >=20 > Or am I missing any other situation where split brain happens and the > namenode directory is not locked, thus allowing the standby namenode = also > to start up? >=20 > Has anybody encountered this? >=20 > Any help is really appreciated. >=20 > Harshad