Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0A10710303 for ; Sat, 15 Mar 2014 03:36:04 +0000 (UTC) Received: (qmail 97286 invoked by uid 500); 15 Mar 2014 03:35:56 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 96856 invoked by uid 500); 15 Mar 2014 03:35:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 96848 invoked by uid 99); 15 Mar 2014 03:35:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Mar 2014 03:35:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of azuryyyu@gmail.com designates 209.85.160.47 as permitted sender) Received: from [209.85.160.47] (HELO mail-pb0-f47.google.com) (209.85.160.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Mar 2014 03:35:46 +0000 Received: by mail-pb0-f47.google.com with SMTP id up15so3404842pbc.34 for ; Fri, 14 Mar 2014 20:35:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:content-transfer-encoding:mime-version:subject :message-id:date:references:in-reply-to:to; bh=J3raNeXFOvaAcUZx+aI9SQlU3JyTJcfuCtnj85T3Jm8=; b=ITBixpAL+aoRyBUkO646m4WNLrsXkexDSaiUr5gcH2HFe2K0Jd38IJwtN2yArs/63u Q5vKd2tdPgtzQV7gTUaEe1RjKj5e9Bxeuh85vsdlIB0MVDXypEdnWeRmIpr4UTteOsbH 2Rq5fMO6AQS1lVSonzxwl50nx2wCdozDPwhkaTroHLgFetPGbkZWsYnNM3UIDMP4wN40 tSedkaaJVVdkY31Hj96+NRs90Z/ImkdLJkQ+digBeZd+v4qQ2UdnNd8xLilAT5pAYMIE P/KEwAuEo7ssPTEvI9eO/hEnaTECTKLIO6sXFAbgPAeVyutwVEd/8ge+yWrnzNbSzAGP fmhQ== X-Received: by 10.68.133.229 with SMTP id pf5mr12791430pbb.115.1394854525755; Fri, 14 Mar 2014 20:35:25 -0700 (PDT) Received: from [192.168.1.100] ([219.143.136.12]) by mx.google.com with ESMTPSA id fg12sm34494741pac.28.2014.03.14.20.35.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 14 Mar 2014 20:35:24 -0700 (PDT) From: Azuryy Content-Type: multipart/alternative; boundary=Apple-Mail-89F30DA7-C028-4233-8E16-294BA9FF3A4D Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) Subject: Re: HA NN Failover question Message-Id: Date: Sat, 15 Mar 2014 11:35:20 +0800 References: In-Reply-To: To: "user@hadoop.apache.org" X-Mailer: iPhone Mail (11D167) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-89F30DA7-C028-4233-8E16-294BA9FF3A4D Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable I suppose NN2 is standby, please check ZKFC2 is alive before stop network on= nn1 Sent from my iPhone5s > On 2014=E5=B9=B43=E6=9C=8815=E6=97=A5, at 10:53, dlmarion wrote: >=20 > Apache Hadoop 2.3.0 >=20 >=20 > Sent via the Samsung GALAXY S=C2=AE4, an AT&T 4G LTE smartphone >=20 >=20 > -------- Original message -------- > From: Azuryy=20 > Date:03/14/2014 10:45 PM (GMT-05:00)=20 > To: user@hadoop.apache.org=20 > Subject: Re: HA NN Failover question=20 >=20 > Which Hadoop version you used? >=20 >=20 > Sent from my iPhone5s >=20 > On 2014=E5=B9=B43=E6=9C=8815=E6=97=A5, at 9:29, dlmarion wrote: >=20 >> Server 1: NN1 and ZKFC1 >> Server 2: NN2 and ZKFC2 >> Server 3: Journal1 and ZK1 >> Server 4: Journal2 and ZK2 >> Server 5: Journal3 and ZK3 >> Server 6+: Datanode >> =20 >> All in the same rack. I would expect the ZKFC from the active name node s= erver to lose its lock and the other ZKFC to tell the standby namenode that i= t should become active (I=E2=80=99m assuming that=E2=80=99s how it works). >> =20 >> - Dave >> =20 >> From: Juan Carlos [mailto:jucaf1@gmail.com]=20 >> Sent: Friday, March 14, 2014 9:12 PM >> To: user@hadoop.apache.org >> Subject: Re: HA NN Failover question >> =20 >> Hi Dave, >> How many zookeeper servers do you have and where are them?=20 >>=20 >> Juan Carlos Fern=C3=A1ndez Rodr=C3=ADguez >>=20 >> El 15/03/2014, a las 01:21, dlmarion escribi=C3=B3= : >>=20 >> I was doing some testing with HA NN today. I set up two NN with active fa= ilover (ZKFC) using sshfence. I tested that its working on both NN by doing =E2= =80=98kill -9 =E2=80=99 on the active NN. When I did this on the active= node, the standby would become the active and everything seemed to work. Ne= xt, I logged onto the active NN and did a =E2=80=98service network stop=E2=80= =99 to simulate a NIC/network failure. The standby did not become the active= in this scenario. In fact, it remained in standby mode and complained in th= e log that it could not communicate with (what was) the active NN. I was una= ble to find anything relevant via searches in Google in Jira. Does anyone ha= ve experience successfully testing this? I=E2=80=99m hoping that it is just a= configuration problem. >> =20 >> FWIW, when the network was restarted on the active NN, it failed over alm= ost immediately. >> =20 >> Thanks, >> =20 >> Dave --Apple-Mail-89F30DA7-C028-4233-8E16-294BA9FF3A4D Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
I suppose NN2 is standby, please check= ZKFC2 is alive before stop network on nn1

Sent from my iPhone5s

On 2014=E5=B9=B43=E6=9C=8815=E6=97=A5, at 10:53, dlmarion <dlmarion@hotmail.com> wrote:
<= br>
Apache Hadoop 2.3.0


Sent via the Samsung GALAXY S=C2=AE= 4, an AT&T 4G LTE smartphone


-------- Original message --------
From: Azuryy
Date:03/14/2014 10:45 PM (GMT-05:00)
To: user@hadoop.apache.org Subject: Re: HA NN Failover question

Which Hadoop version you used?


Sent from my iPhone5s

On 2014=E5=B9=B43=E6=9C=8815=E6=97=A5, at 9:29, dlmarion <dlmarion@hotmail.com> wrote:

Server 1: NN1 and ZKFC= 1

Server 2: NN2 and ZKFC= 2

Server 3: Journal1 an= d ZK1

Server 4: Journal2 an= d ZK2

Server 5: Journal3 an= d ZK3

Server 6+: Datanode

 

All in the same rack.= I would expect the ZKFC from the active name node server to lose its lock a= nd the other ZKFC to tell the standby namenode that it should become active (= I=E2=80=99m assuming that=E2=80=99s how it works).

 

- Dave

 

From: Juan Ca= rlos [mailto:jucaf1@gmail.com]
Sent: Friday, March 14, 2014 9:12 PM
To: user@hadoop.apache.org<= /a>
Subject: Re: HA NN Failover question

 

Hi Dave,

How many zookeeper servers do you have and where ar= e them? 


Juan Carlos Fern=C3=A1ndez Rodr=C3=ADguez

I was doing some testing with HA NN today. I set up= two NN with active failover (ZKFC) using sshfence. I tested that its workin= g on both NN by doing =E2=80=98kill -9 <pid>=E2=80=99 on the active NN= . When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the ac= tive NN and did a =E2=80=98service network stop=E2=80=99 to simulate a NIC/n= etwork failure. The standby did not become the active in this scenario. In f= act, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I w= as unable to find anything relevant via searches in Google in Jira. Does any= one have experience successfully testing this? I=E2=80=99m hoping that it is= just a configuration problem.

 

FWIW, when the network was restarted on the active N= N, it failed over almost immediately.

 

Thanks,

 

Dave

= --Apple-Mail-89F30DA7-C028-4233-8E16-294BA9FF3A4D--