Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.161 as permitted sender)
Message-ID: <BLU0-SMTP312D768699FD0A20A4C3AF8FFF0@phx.gbl>
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0 (Apple Message framework v1278)
Subject: Re: Split brain - is it possible in hadoop?
From: Michael Segel <michael_segel@hotmail.com>
In-Reply-To: 
 <CAEbjYSDJG57-Nn61PaE92za1pgtvQGgAX8HyS0odJpaA+oisLQ@mail.gmail.com>
Date: Tue, 19 Jun 2012 07:47:30 -0500
Content-Transfer-Encoding: quoted-printable
References: 
 <CAEbjYSDJG57-Nn61PaE92za1pgtvQGgAX8HyS0odJpaA+oisLQ@mail.gmail.com>
To: common-user@hadoop.apache.org

In your example, you only have one active Name Node. So how would you =
encounter a 'split brain' scenario?=20
Maybe it would be better if you defined what you mean by a split brain?

-Mike

On Jun 18, 2012, at 8:30 PM, hdev ml wrote:

> All hadoop contributors/experts,
>=20
> I am trying to simulate split brain in our installation. There are a =
few
> things we want to know
>=20
> 1. Does data corruption happen?
> 2. If Yes in #1, how to recover from it.
> 3. What are the corrective steps to take in this situation e.g. =
killing one
> namenode etc
>=20
> So to simulate this I took following steps.
>=20
> 1. We already have a healthy test cluster, consisting of 4 machines. =
One
> machine runs namenode and a datanode, other machine runs =
secondarynamenode
> and a datanode, 3rd runs jobtracker and a datanode, and 4th one just a
> datanode.
> 2. Copied the hadoop installation folder to a new location in the =
datanode.
> 3. Kept all configurations same in hdfs-site and core-site xmls, =
except
> renamed the fs.default.name to a different URI
> 4. The namenode directory - dfs.name.dir was pointing to the same =
shared
> NFS mounted directory to which the main namenode points to.
>=20
> I started this standby namenode using following command
> bin/hadoop-daemon.sh --config conf --hosts slaves start namenode
>=20
> It errored out saying that "the directory is already locked", which is =
an
> expected behaviour. The directory has been locked by the original =
namenode.
>=20
> So I changed the dfs.name.dir to some other folder, and issued the =
same
> command. It fails with message - "namenode has not been formatted", =
which
> is also expected.
>=20
> This makes me think - does splitbrain situation really occur in =
hadoop?
>=20
> My understanding is that split brain happens because of timeouts on =
the
> main namenode. The way it happens is, when the timeout occurs, the HA
> implementation - Be it Linux HA, Veritas etc., thinks that the main
> namenode has died and tries to start the standby namenode. The standby
> namenode starts up and then main namenode comes back from the timeout =
phase
> and starts functioning as if nothing happened, giving rise to 2 =
namenodes
> in the cluster - Split Brain.
>=20
> Considering the error messages and the above understanding, I cannot =
point
> 2 different namenodes to same directory, because the main namenode =
isn't
> responding but has locked the directory.
>=20
> So can I safely conclude that split brain does not occur in hadoop?
>=20
> Or am I missing any other situation where split brain happens and the
> namenode directory is not locked, thus allowing the standby namenode =
also
> to start up?
>=20
> Has anybody encountered this?
>=20
> Any help is really appreciated.
>=20
> Harshad