hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Quentin Ambard <quentin.amb...@gmail.com>
Subject Re: changing ha failover auto conf value
Date Thu, 22 Nov 2012 21:43:35 GMT
Hi
Here is what i'm doing :

NN1 (active) + ZKFC1
NN2 (standby) + ZKFC2

First I stop the ZKFC1 service =>
NN1 (standby)
NN2 (active) + ZKFC2

Then I kill the active node : kill -9 on NN2 process

NN1 stay on standby

ZKFC2 log :

2012-11-22 22:23:40,073 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2012-11-22 22:23:40,081 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists:
0a096d79636c757374657212036e6e321a106e733233363833342e6f76682e6e657420d43e28d33e
2012-11-22 22:23:40,082 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at /nn2:8020
2012-11-22 22:23:40,205 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at /nn2:8020 to standby state without
fencing
2012-11-22 22:23:40,205 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the
local node is the most recent active...
2012-11-22 22:23:40,233 INFO org.apache.hadoop.ha.ZKFailoverController:
Trying to make NameNode at xxxx/nn1:8020 active...
2012-11-22 22:23:40,605 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at xxxx/nn1:8020 to active state
2012-11-22 22:24:14,073 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Failed on local exception: java.io.IOException: Response is
null.; Host Details : local host is: "xxxx/nn1"; destination host is:
"xxxx":8020;
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.HealthMonitor: Entering
state SERVICE_NOT_RESPONDING
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ZKFailoverController:
Local service NameNode at xxxx/nn1:8020 entered state:
SERVICE_NOT_RESPONDING
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ZKFailoverController:
Quitting master election for NameNode at xxxx/nn1:8020 and marking that
fencing is necessary
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2012-11-22 22:24:14,128 INFO org.apache.zookeeper.ZooKeeper: Session:
0x23b29574aed0014 closed
2012-11-22 22:24:14,128 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x23b29574aed0014
2012-11-22 22:24:14,128 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2012-11-22 22:24:16,129 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:16,130 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:  http://wiki.apache.org/hadoop/ConnectionRefused
2012-11-22 22:24:18,131 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:18,131 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:  http://wiki.apache.org/hadoop/ConnectionRefused
2012-11-22 22:24:20,133 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:20,133 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:  http://wiki.apache.org/hadoop/ConnectionRefused
2012-11-22 22:24:22,135 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:22,136 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:  http://wiki.apache.org/hadoop/ConnectionRefused
...


NN1 logs :
2012-11-22 22:23:40,109 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services
started for active state
2012-11-22 22:23:40,109 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 166
2012-11-22 22:23:40,110 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2
Total time for transactions(ms): 0Number of transactions batched in Syncs:
0 Number of syncs: 1 SyncTimes(ms): 32 125
2012-11-22 22:23:40,182 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2
Total time for transactions(ms): 0Number of transactions batched in Syncs:
0 Number of syncs: 2 SyncTimes(ms): 85 144
2012-11-22 22:23:40,196 INFO
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits
file /home/hdfs/dfs/name/current/edits_inprogress_0000000000000000166 ->
/home/hdfs/dfs/name/current/edits_0000000000000000166-0000000000000000167
2012-11-22 22:23:40,196 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services
required for standby state
2012-11-22 22:23:40,198 INFO
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on
active node at /nn2:8020 every 120 seconds.
2012-11-22 22:23:40,199 INFO
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting
standby checkpoint thread...
Checkpointing active NN at nn2:50070
Serving checkpoints at xxxx/nn1:50070
2012-11-22 22:25:40,235 INFO
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log
roll on remote NameNode /nn2:8020
2012-11-22 22:25:41,248 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn2:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2012-11-22 22:25:42,258 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn2:8020. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2012-11-22 22:25:43,268 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn2:8020. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2012-11-22 22:25:44,279 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn2:8020. Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2012-11-22 22:25:45,289 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn2:8020. Already tried 4 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2012-11-22 22:25:46,300 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn2:8020. Already tried 5 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2012-11-22 22:25:47,310 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn2:8020. Already tried 6 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...

Thanks for your help

2012/11/22 Harsh J <harsh@cloudera.com>

> Hi,
>
> Losing a complete node (ZKFC plus NN) with a journal node (QJM)
> configuration shouldn't be causing automatic failover to fail. Could
> you post up both your NameNode and ZKFC logs somewhere we can take a
> look?
>
> On Fri, Nov 23, 2012 at 12:41 AM, Quentin Ambard
> <quentin.ambard@gmail.com> wrote:
> > Hello,
> > I have 2 namenodes in ha mode, running with 3 journal node, 3 zookeeper
> > servers and 2 zkfc (one with each namenode)
> >
> > If a server with the activated namenode and a zkfc get both down, the
> single
> > instance of zkfc can't activate the standby namenode.
> >
> > So I end with a single namenode in standby mode.
> > I can try to activate it with the following :
> > hdfs haadmin -transitionToActive nn1 --forcemanual
> >
> > But it's recommended to disable the automatic failover to avoid
> split-brain.
> > To do so, i stop all my namenode and set the
> > dfs.ha.automatic-failover.enabled property to false.
> >
> > However, restarting the namenode doesn't change this configuration, i'm
> > still getting the same warning while trying to activate the namenode.
> >
> > How can I change this configuration value ?
> >
> > Do I really need to have 3 namenode to avoid this situation (namenode
> > manually activation), or can I achieve a full-auto conf with only 2
> namenode
> > ?
> >
> >
> > Thanks for your help
> >
> >
> > --
> > Quentin Ambard
>
>
>
> --
> Harsh J
>



-- 
Quentin Ambard

Mime
View raw message