Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 489BED2E5 for ; Thu, 22 Nov 2012 21:44:26 +0000 (UTC) Received: (qmail 10385 invoked by uid 500); 22 Nov 2012 21:44:21 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 10273 invoked by uid 500); 22 Nov 2012 21:44:21 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 10264 invoked by uid 99); 22 Nov 2012 21:44:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Nov 2012 21:44:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of quentin.ambard@gmail.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Nov 2012 21:44:15 +0000 Received: by mail-ob0-f176.google.com with SMTP id un3so9750884obb.35 for ; Thu, 22 Nov 2012 13:43:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=Lodv2H7EweRlrL7Mkl7gP0tAPZDu2s2uXtw7/ZEwUeA=; b=KdsWD13IBqhOFQOqQNyLQPHHKJtH6koxPJuAg4FWq0+mkgGpp2vf0bQ9emacQdLd4T kv03Bsa99UPWY2mAGjUhYJ2l6AP9bW6EBLga/owJcPdDiH44aHzN0GhqUxYFZ4MCOjU3 0pY2Is+plG87L955O4CuhQSWeXwbtHLRNk7f8GP1+D5Sx9UfXq6YS5DHv/0pqP+iUwXA qI+pH0OKlcc47ZDWNWDpFlCm0VSM7As3gayyMgZhT3jFE1h3Ehu6XWC6sNy/JcxpN8el tETo1GwbaNOmwoOhQnTRrapLRGUGYIlIWyCPrMcpDNc3MIs3uWv1rJLJOXYiXH4YCW+q 5FfA== Received: by 10.60.0.199 with SMTP id 7mr1344087oeg.139.1353620635179; Thu, 22 Nov 2012 13:43:55 -0800 (PST) MIME-Version: 1.0 Received: by 10.60.45.73 with HTTP; Thu, 22 Nov 2012 13:43:35 -0800 (PST) In-Reply-To: References: From: Quentin Ambard Date: Thu, 22 Nov 2012 22:43:35 +0100 Message-ID: Subject: Re: changing ha failover auto conf value To: user Content-Type: multipart/alternative; boundary=e89a8fb1f3be1eda8204cf1c5db6 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb1f3be1eda8204cf1c5db6 Content-Type: text/plain; charset=ISO-8859-1 Hi Here is what i'm doing : NN1 (active) + ZKFC1 NN2 (standby) + ZKFC2 First I stop the ZKFC1 service => NN1 (standby) NN2 (active) + ZKFC2 Then I kill the active node : kill -9 on NN2 process NN1 stay on standby ZKFC2 log : 2012-11-22 22:23:40,073 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced... 2012-11-22 22:23:40,081 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a096d79636c757374657212036e6e321a106e733233363833342e6f76682e6e657420d43e28d33e 2012-11-22 22:23:40,082 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at /nn2:8020 2012-11-22 22:23:40,205 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at /nn2:8020 to standby state without fencing 2012-11-22 22:23:40,205 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the local node is the most recent active... 2012-11-22 22:23:40,233 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at xxxx/nn1:8020 active... 2012-11-22 22:23:40,605 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at xxxx/nn1:8020 to active state 2012-11-22 22:24:14,073 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at xxxx/nn1:8020: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "xxxx/nn1"; destination host is: "xxxx":8020; 2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING 2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at xxxx/nn1:8020 entered state: SERVICE_NOT_RESPONDING 2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at xxxx/nn1:8020 and marking that fencing is necessary 2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2012-11-22 22:24:14,128 INFO org.apache.zookeeper.ZooKeeper: Session: 0x23b29574aed0014 closed 2012-11-22 22:24:14,128 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x23b29574aed0014 2012-11-22 22:24:14,128 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2012-11-22 22:24:16,129 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS) 2012-11-22 22:24:16,130 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2012-11-22 22:24:18,131 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS) 2012-11-22 22:24:18,131 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2012-11-22 22:24:20,133 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS) 2012-11-22 22:24:20,133 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2012-11-22 22:24:22,135 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS) 2012-11-22 22:24:22,136 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused ... NN1 logs : 2012-11-22 22:23:40,109 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state 2012-11-22 22:23:40,109 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 166 2012-11-22 22:23:40,110 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 32 125 2012-11-22 22:23:40,182 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 85 144 2012-11-22 22:23:40,196 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /home/hdfs/dfs/name/current/edits_inprogress_0000000000000000166 -> /home/hdfs/dfs/name/current/edits_0000000000000000166-0000000000000000167 2012-11-22 22:23:40,196 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state 2012-11-22 22:23:40,198 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at /nn2:8020 every 120 seconds. 2012-11-22 22:23:40,199 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread... Checkpointing active NN at nn2:50070 Serving checkpoints at xxxx/nn1:50070 2012-11-22 22:25:40,235 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /nn2:8020 2012-11-22 22:25:41,248 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2012-11-22 22:25:42,258 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2012-11-22 22:25:43,268 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2012-11-22 22:25:44,279 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2012-11-22 22:25:45,289 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2012-11-22 22:25:46,300 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2012-11-22 22:25:47,310 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) ... Thanks for your help 2012/11/22 Harsh J > Hi, > > Losing a complete node (ZKFC plus NN) with a journal node (QJM) > configuration shouldn't be causing automatic failover to fail. Could > you post up both your NameNode and ZKFC logs somewhere we can take a > look? > > On Fri, Nov 23, 2012 at 12:41 AM, Quentin Ambard > wrote: > > Hello, > > I have 2 namenodes in ha mode, running with 3 journal node, 3 zookeeper > > servers and 2 zkfc (one with each namenode) > > > > If a server with the activated namenode and a zkfc get both down, the > single > > instance of zkfc can't activate the standby namenode. > > > > So I end with a single namenode in standby mode. > > I can try to activate it with the following : > > hdfs haadmin -transitionToActive nn1 --forcemanual > > > > But it's recommended to disable the automatic failover to avoid > split-brain. > > To do so, i stop all my namenode and set the > > dfs.ha.automatic-failover.enabled property to false. > > > > However, restarting the namenode doesn't change this configuration, i'm > > still getting the same warning while trying to activate the namenode. > > > > How can I change this configuration value ? > > > > Do I really need to have 3 namenode to avoid this situation (namenode > > manually activation), or can I achieve a full-auto conf with only 2 > namenode > > ? > > > > > > Thanks for your help > > > > > > -- > > Quentin Ambard > > > > -- > Harsh J > -- Quentin Ambard --e89a8fb1f3be1eda8204cf1c5db6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi
Here is what i'm doing :

NN1 (active) + ZKFC1=
NN2 (standby) + ZKFC2

=
First I stop the=A0ZKFC1 service =3D>
NN1 (standby)
NN2 (active)=A0+ ZKFC2
<= div class=3D"gmail_extra">
Then I kill = the active node :=A0kill -9 on NN2 process
=
NN1 stay on standby

ZKFC2 log :

2012-11-22 22:23:40,073 INFO org.apache.hadoop.ha.ActiveStandbyElector: Che= cking for any old active which needs to be fenced...
2012-11-22 22:23:40,081 INFO org.apache.hadoop.ha.ActiveStandbyEl= ector: Old node exists: 0a096d79636c757374657212036e6e321a106e7332333638333= 42e6f76682e6e657420d43e28d33e
2012-11-22 22:23:40,082 INFO org.apache.hadoop.h= a.ZKFailoverController: Should fence: NameNode at /nn2:8020
2012-11-22 22:23:40,205 INFO org.apache.hadoop.ha.ZKFailov= erController: Successfully transitioned NameNode at /nn2:8020 to standby st= ate without fencing
2012-11-22 22:23:40,205 INFO org.apache.hadoop.h= a.ActiveStandbyElector: Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb= to indicate that the local node is the most recent active...
2012-11-22 22:23:40,233 INFO org.apache.hadoop.ha.ZKFailoverController: Try= ing to make NameNode at xxxx/nn1:8020 active...
2012-11-22 22:23:40,605 INFO org.apache.hadoop.ha.ZKFailoverController= : Successfully transitioned NameNode at xxxx/nn1:8020 to active state
2012-11-22 22:24:14,073 WARN org.apache.hadoop.h= a.HealthMonitor: Transport-level exception trying to monitor health of Name= Node at xxxx/nn1:8020: Failed on local exception: java.io.IOException: Resp= onse is null.; Host Details : local host is: "xxxx/nn1"; destinat= ion host is: "xxxx":8020;=A0
2012-11-22 22:24:14,074 INFO org.apache.hadoop.h= a.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ZKFailoverCon= troller: Local service NameNode at xxxx/nn1:8020 entered state: SERVICE_NOT= _RESPONDING
2012-11-22 22:24:14,074 INFO org.apache.hadoop.h= a.ZKFailoverController: Quitting master election for NameNode at xxxx/nn1:8= 020 and marking that fencing is necessary
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yie= lding from election
2012-11-22 22:24:14,128 INFO org.apache.zookeepe= r.ZooKeeper: Session: 0x23b29574aed0014 closed
2012-11-22 22:24:14,128 WARN org.apache.hadoop.ha.ActiveStandbyElector:= Ignoring stale result from old client with sessionId 0x23b29574aed0014
2012-11-22 22:24:14,128 INFO org.apache.zookeepe= r.ClientCnxn: EventThread shut down
2012-11= -22 22:24:16,129 INFO org.apache.hadoop.ipc.Client: Retrying connect to ser= ver: xxxx/nn1:8020. Already tried 0 time(s); retry policy is RetryUpToMaxim= umCountWithFixedSleep(maxRetries=3D1, sleepTime=3D1 SECONDS)
2012-11-22 22:24:16,130 WARN org.apache.hadoop.h= a.HealthMonitor: Transport-level exception trying to monitor health of Name= Node at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection= exception: java.net.ConnectException: Connection refused; For more details= see: =A0http:/= /wiki.apache.org/hadoop/ConnectionRefused
2012-11-22 22:24:18,131 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn1:8020. Already tried 0 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D1, sle= epTime=3D1 SECONDS)
2012-11-22 22:24:18,131 WARN org.apache.hadoop.h= a.HealthMonitor: Transport-level exception trying to monitor health of Name= Node at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection= exception: java.net.ConnectException: Connection refused; For more details= see: =A0http:/= /wiki.apache.org/hadoop/ConnectionRefused
2012-11-22 22:24:20,133 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn1:8020. Already tried 0 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D1, sle= epTime=3D1 SECONDS)
2012-11-22 22:24:20,133 WARN org.apache.hadoop.h= a.HealthMonitor: Transport-level exception trying to monitor health of Name= Node at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection= exception: java.net.ConnectException: Connection refused; For more details= see: =A0http:/= /wiki.apache.org/hadoop/ConnectionRefused
2012-11-22 22:24:22,135 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn1:8020. Already tried 0 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D1, sle= epTime=3D1 SECONDS)
2012-11-22 22:24:22,136 WARN org.apache.hadoop.h= a.HealthMonitor: Transport-level exception trying to monitor health of Name= Node at xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection= exception: java.net.ConnectException: Connection refused; For more details= see: =A0http:/= /wiki.apache.org/hadoop/ConnectionRefused
...
=

N= N1 logs :
2012-1= 1-22 22:23:40,109 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:= Stopping services started for active state
2012-11-22 22:23:40,109 INFO org.apache.hadoop.h= dfs.server.namenode.FSEditLog: Ending log segment 166
2012-11-22 22:23:40,110 INFO org.apache.hadoop.hdfs.server.namen= ode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 0= Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms)= : 32 125=A0
2012-11-22 22:23:40,182 INFO org.apache.hadoop.h= dfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for tra= nsactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs:= 2 SyncTimes(ms): 85 144=A0
2012-11-22 22:23:40,196 INFO org.apache.hadoop.h= dfs.server.namenode.FileJournalManager: Finalizing edits file /home/hdfs/df= s/name/current/edits_inprogress_0000000000000000166 -> /home/hdfs/dfs/na= me/current/edits_0000000000000000166-0000000000000000167
2012-11-22 22:23:40,196 INFO org.apache.hadoop.h= dfs.server.namenode.FSNamesystem: Starting services required for standby st= ate
2012-11-22 22:23:40,198 INFO org.apache= .hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active nod= e at /nn2:8020 every 120 seconds.
2012-11-22 22:23:40,199 INFO org.apache.hadoop.h= dfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thr= ead...
Checkpointing active NN at nn2:50070=
Serving checkpoints at xxxx/nn1:50070
2012-11-22 22:25:40,235 INFO org.apache.hadoop.hdfs.s= erver.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /nn= 2:8020
2012-11-22 22:25:41,248 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 0 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sl= eepTime=3D1 SECONDS)
2012-11-22 22:25:42,258 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 1 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sl= eepTime=3D1 SECONDS)
2012-11-22 22:25:43,268 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 2 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sl= eepTime=3D1 SECONDS)
2012-11-22 22:25:44,279 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 3 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sl= eepTime=3D1 SECONDS)
2012-11-22 22:25:45,289 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 4 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sl= eepTime=3D1 SECONDS)
2012-11-22 22:25:46,300 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 5 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sl= eepTime=3D1 SECONDS)
2012-11-22 22:25:47,310 INFO org.apache.hadoop.i= pc.Client: Retrying connect to server: xxxx/nn2:8020. Already tried 6 time(= s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sl= eepTime=3D1 SECONDS)
...

Th= anks for your help
--e89a8fb1f3be1eda8204cf1c5db6--