Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1C0E410F16 for ; Tue, 3 Dec 2013 02:47:19 +0000 (UTC) Received: (qmail 9746 invoked by uid 500); 3 Dec 2013 02:47:14 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 9655 invoked by uid 500); 3 Dec 2013 02:47:14 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 9648 invoked by uid 99); 3 Dec 2013 02:47:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 02:47:14 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yypvsxf19870706@gmail.com designates 209.85.128.53 as permitted sender) Received: from [209.85.128.53] (HELO mail-qe0-f53.google.com) (209.85.128.53) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 02:47:08 +0000 Received: by mail-qe0-f53.google.com with SMTP id nc12so13025388qeb.12 for ; Mon, 02 Dec 2013 18:46:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=fuMVV4Mfi8pnFwO9q8AZnoPLocmZ2rvcGhOFwpeTil8=; b=qbwyLluECyfLfaMego8DGCsRfE+WQxT5beYG/gdV5ulNfHym46CHbl4/dBa1XOyoC8 9K57xu2bo7rRM90CoGiI/EYht52TTvPKVOZ1HAE980JXXZTJH/tgqeWWhnyPi/vcUERz 6c+vM9NVlxiziDo41ez7xD3tHBabi1bCZKjN57XNpickjlkk9IX1RhoEz+LwEBsPspKR Pyf/IYkHEBcaBPlhnVQRmQaMeKEyP8A8Z5ZXlSkRPDIQTg0hpWiN1gMJLo0P1Jxo/WaG +fMmuThgswCE9NfofD6IPn/rIwdYx5Q+qwS6ucLXx/TlryVturZLsxCrc/d7a9UwUSFT az8A== MIME-Version: 1.0 X-Received: by 10.49.24.82 with SMTP id s18mr121572946qef.27.1386038807950; Mon, 02 Dec 2013 18:46:47 -0800 (PST) Received: by 10.96.128.137 with HTTP; Mon, 2 Dec 2013 18:46:47 -0800 (PST) In-Reply-To: References: Date: Tue, 3 Dec 2013 10:46:47 +0800 Message-ID: Subject: Re: Can not auto-failover when unplug network interface From: YouPeng Yang To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b3a94a2cafb3704ec984e53 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b3a94a2cafb3704ec984e53 Content-Type: text/plain; charset=ISO-8859-1 Hi Yu I think when unplug the nic ,the ssh could not make through because it can not connect to failed active NN. Suppose that ,the sshfence will failed. Am I right? 2013/12/3 YouPeng Yang > Hi Yu > > Thanks for your response. > I'm sure my ssh setup is good. Ssh from act NN to stanby nn need no > password. > > > > > > > > I attached my config > ------core-site.xml----------------- > > > > fs.defaultFS > hdfs://lklcluster > true > > > > hadoop.tmp.dir > /home/hadoop/tmp2 > > > > > > > -------hdfs-site.xml---------- > --- > > > > dfs.namenode.name.dir > /home/hadoop/namedir2 > > > > dfs.datanode.data.dir > /home/hadoop/datadir2 > > > > dfs.nameservices > lklcluster > > > > dfs.ha.namenodes.lklcluster > nn1,nn2 > > > dfs.namenode.rpc-address.lklcluster.nn1 > hadoop2:8020 > > > dfs.namenode.rpc-address.lklcluster.nn2 > hadoop3:8020 > > > > dfs.namenode.http-address.lklcluster.nn1 > hadoop2:50070 > > > > dfs.namenode.http-address.lklcluster.nn2 > hadoop3:50070 > > > > dfs.namenode.shared.edits.dir > > qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster > > > dfs.client.failover.proxy.provider.lklcluster > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > > dfs.ha.fencing.methods > sshfence > > > > dfs.ha.fencing.ssh.private-key-files > /home/hadoop/.ssh/id_rsa > > > > dfs.ha.fencing.ssh.connect-timeout > 5000 > > > > dfs.journalnode.edits.dir > /home/hadoop/journal/data > > > > dfs.ha.automatic-failover.enabled > true > > > > ha.zookeeper.quorum > hadoop1:2181,hadoop2:2181,hadoop3:2181 > > > > > > > 2013/12/3 Azuryy Yu > >> This is still because your fence method configuraed improperly. >> plseae paste your fence configuration. and double check you can ssh on >> active NN to standby NN without password. >> >> >> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang wrote: >> >>> Hi >>> Another auto-failover testing problem: >>> >>> My HA can auto-failover after I kill the active NN.When it comes to >>> the unplug network interface to simulate the hardware fail,the >>> auto-failover seems not to work after wait for times -the zkfc logs as >>> [1]. >>> >>> I'm using the default sshfence. >>> >>> >>> >>> >>> >>> >>> [1] zkfc >>> logs---------------------------------------------------------------------------------------- >>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ====== >>> Beginning Service Fencing Process... ====== >>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying >>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) >>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >>> Connecting to hadoop3... >>> 2013-12-03 10:05:56,651 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22 >>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: >>> Unable to connect to hadoop3 as user hadoop >>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route >>> to host >>> at com.jcraft.jsch.Util.createSocket(Util.java:386) >>> at com.jcraft.jsch.Session.connect(Session.java:182) >>> at >>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100) >>> at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) >>> at >>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing >>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. >>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to >>> fence service by any configured method. >>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector: >>> Exception handling the winning of election >>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/ >>> 10.7.23.124:8020 >>> at >>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) >>> at >>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector: >>> Trying to re-establish ZK session >>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session: >>> 0x142931031810260 closed >>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating >>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181 >>> sessionTimeout=5000 >>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea >>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening >>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt >>> to authenticate using SASL (Unable to locate a login configuration) >>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket >>> connection established to hadoop1/10.7.23.122:2181, initiating session >>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session >>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid = >>> 0x142931031810261, negotiated timeout = 5000 >>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: >>> EventThread shut down >>> >> >> > --047d7b3a94a2cafb3704ec984e53 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Yu
=A0=A0
=A0=A0 I think when un= plug the nic ,the ssh could not make through because it can not connect to= =A0 failed=A0 active NN.
Suppose that ,the sshfence will failed.
=A0=A0 Am I right?


2= 013/12/3 YouPeng Yang <yypvsxf19870706@gmail.com>
Hi Yu

=A0 Thanks for your resp= onse.
=A0 I'm sure my ssh setup is good. Ssh from=A0 act NN to= stanby nn need no password.
=A0



<= div>=A0=A0


I attached my config
------core-site.xml-----= ------------

<configuration>
=A0<property>=
=A0=A0=A0=A0 <name>fs.defaultFS</name>
=A0=A0=A0=A0 <value>hdfs://lklcluster</value>
=A0=A0=A0=A0 &= lt;final>true</final>
=A0</property>
=A0
=A0<pro= perty>
=A0=A0=A0=A0 <name>hadoop.tmp.dir</name>
=A0=A0= =A0=A0 <value>/home/hadoop/tmp2</value>
=A0</property>


</configuration>


---= ----hdfs-site.xml----------
---

<configurati= on>
=A0<property>
=A0=A0=A0=A0 <name>dfs.namenode.name= .dir</name>
=A0=A0=A0 <value>/home/hadoop/namedir2</value>=A0
=A0</property>

=A0<property>
=A0=A0=A0=A0 <name>= ;dfs.datanode.data.dir</name>
=A0=A0=A0=A0 <value>/home/hado= op/datadir2</value>
=A0</property>

=A0<property>= ;
=A0=A0 <name>dfs.nameservices</name>
=A0=A0 <value>lklcluster</value>
</property>

&l= t;property>
=A0=A0=A0 <name>dfs.ha.namenodes.lklcluster</nam= e>
=A0=A0=A0 <value>nn1,nn2</value>
</property><= br><property>
=A0 <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
=A0= <value>hadoop2:8020</value>
</property>
<proper= ty>
=A0=A0=A0 <name>dfs.namenode.rpc-address.lklcluster.nn2<= /name>
=A0=A0=A0 <value>hadoop3:8020</value>
</property>
<= br><property>
=A0 <name>dfs.namenode.http-address.lklcluster= .nn1</name>
=A0=A0=A0 <value>hadoop2:50070</value>
= </property>

<property>
=A0=A0=A0 <name>dfs.namenode.http-address.lkl= cluster.nn2</name>
=A0=A0=A0 <value>hadoop3:50070</value&= gt;
</property>

<property>
=A0 <name>dfs.nam= enode.shared.edits.dir</name>
=A0 <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklclust= er</value>
</property>
<property>
=A0 <name&g= t;dfs.client.failover.proxy.provider.lklcluster</name>
=A0 <val= ue>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProv= ider</value>
</property>
<property>
=A0 <name>dfs.ha.fencing.met= hods</name>
=A0 <value>sshfence</value>
</proper= ty>

<property>
=A0 <name>dfs.ha.fencing.ssh.privat= e-key-files</name>
=A0=A0 <value>/home/hadoop/.ssh/id_rsa</value>
</property= >
=A0=A0=A0
<property>=A0
=A0=A0=A0 <name>dfs.ha.= fencing.ssh.connect-timeout</name>=A0
=A0=A0=A0=A0 <value>5= 000</value>=A0
</property>=A0

<property>
=A0 <name>dfs.journalnode.edits.dir</name&= gt;
=A0=A0 <value>/home/hadoop/journal/data</value>
</= property>

<property>
=A0=A0 <name>dfs.ha.automatic= -failover.enabled</name>
=A0=A0=A0=A0=A0 <value>true</value>
</property>
=A0= =A0=A0=A0=A0=A0=A0
<property>
=A0=A0=A0=A0 <name>ha.zook= eeper.quorum</name>
=A0=A0=A0=A0 <value>hadoop1:2181,hadoop2= :2181,hadoop3:2181</value>
</property>

</configuration>

=


2013/12/3 Azuryy Yu <= azuryyyu@gmail.com>
This is still because your = fence method configuraed improperly.=A0
plseae paste your fence configu= ration. and double check you can ssh on active NN to standby NN without pas= sword.


On Tue, Dec 3, 2013 at 10:23 AM, YouPeng= Yang <yypvsxf19870706@gmail.com> wrote:
Hi
=A0=A0 Another auto-failover te= sting problem:
=A0
=A0=A0 My HA can auto-failover after I kill= the active NN.When it comes to the unplug=A0 network interface to simulate= the hardware fail,the auto-failover seems=A0 not to work after=A0=A0 wait = for times -the zkfc logs as [1].

=A0=A0 I'm using the default sshfence.
=A0=A0
=A0




[1] zkfc logs-------------------------------= ---------------------------------------------------------
2013-12-03 10:= 05:56,650 INFO org.apache.hadoop.ha.NodeFencer: =3D=3D=3D=3D=3D=3D Beginnin= g Service Fencing Process... =3D=3D=3D=3D=3D=3D
2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying method= 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2013-12-03 10:05:56,6= 51 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to hadoop3... 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: C= onnecting to hadoop3 port 22
2013-12-03 10:05:59,648 WARN org.apache.had= oop.ha.SshFenceByTcpPort: Unable to connect to hadoop3 as user hadoop
com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to= host
=A0=A0=A0 at com.jcraft.jsch.Util.createSocket(Util.java:386)
= =A0=A0=A0 at com.jcraft.jsch.Session.connect(Session.java:182)
=A0=A0=A0= at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:= 100)
=A0=A0=A0 at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
= =A0=A0=A0 at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverCo= ntroller.java:521)
=A0=A0=A0 at org.apache.hadoop.ha.ZKFailoverControlle= r.fenceOldActive(ZKFailoverController.java:494)
=A0=A0=A0 at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailov= erController.java:59)
=A0=A0=A0 at org.apache.hadoop.ha.ZKFailoverContro= ller$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
=A0= =A0=A0 at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveSt= andbyElector.java:900)
=A0=A0=A0 at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveS= tandbyElector.java:799)
=A0=A0=A0 at org.apache.hadoop.ha.ActiveStandbyE= lector.processResult(ActiveStandbyElector.java:415)
=A0=A0=A0 at org.apa= che.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
=A0=A0=A0 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.jav= a:495)
2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fen= cing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.<= br> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fe= nce service by any configured method.
2013-12-03 10:05:59,650 WARN org.a= pache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of ele= ction
java.lang.RuntimeException: Unable to fence NameNode at hadoop3/10.7.23.124:8020
=A0=A0= =A0 at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverControll= er.java:522)
=A0=A0=A0 at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFai= loverController.java:494)
=A0=A0=A0 at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailov= erController.java:59)
=A0=A0=A0 at org.apache.hadoop.ha.ZKFailoverContro= ller$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
=A0= =A0=A0 at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveSt= andbyElector.java:900)
=A0=A0=A0 at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveS= tandbyElector.java:799)
=A0=A0=A0 at org.apache.hadoop.ha.ActiveStandbyE= lector.processResult(ActiveStandbyElector.java:415)
=A0=A0=A0 at org.apa= che.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
=A0=A0=A0 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.jav= a:495)
2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyEl= ector: Trying to re-establish ZK session
2013-12-03 10:05:59,676 INFO or= g.apache.zookeeper.ZooKeeper: Session: 0x142931031810260 closed
2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating cli= ent connection, connectString=3Dhadoop1:2181,hadoop2:2181,hadoop3:2181 sess= ionTimeout=3D5000 watcher=3Dorg.apache.hadoop.ha.ActiveStandbyElector$Watch= erWithClientRef@5ce2acea
2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening socke= t connection to server hadoop1/10.7.23.122:2181. Will not attempt to authenticate using SA= SL (Unable to locate a login configuration)
2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket connec= tion established to hadoop1/10.7.23.122:2181, initiating session
2013-12-03 10:06:00,709 = INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on ser= ver hadoop1/10.7.23.1= 22:2181, sessionid =3D 0x142931031810261, negotiated timeout =3D 5000 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread s= hut down



--047d7b3a94a2cafb3704ec984e53--