ambari-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anandha L Ranganathan <analog.s...@gmail.com>
Subject NameNode HA -Blueprints - Standby NN failed and Active NN created
Date Tue, 25 Aug 2015 18:23:20 GMT
Hi

I am trying to install Active Namenode HA using blueprints.
During the cluster creation through scripts, it does  following and
completes.

1) Journal nodes starts and initialized (formats journal node).
2) Initialization the HA state in zookeeper  or ZKFC ( Both in Active and
Standby namenode )
After 96% it fails.    I logged into the cluster using UI and re-started
the standby namenode. But it throw the exception saying that Namenode not
formatted.
I have to manually copy the fsimage logs from using this command, "hdfs
namenode -bootstrapStandby -force " in the standby NN server.
and re-starting the namenode  works fine and  goes into standby mode.

Is it something I am missing in the configuration ?
My Namenode HA blue prints looks like this.

hadoop-env{
 "dfs_ha_initial_namenode_active": "%HOSTGROUP::host_group_master_1%"
"dfs_ha_initial_namenode_standby": "%HOSTGROUP::host_group_master_2"
}


hadoop-ev{

        "dfs_ha_initial_namenode_active":
"%HOSTGROUP::host_group_master_1%"
        "dfs_ha_initial_namenode_standby": "%HOSTGROUP::host_group_master_2"
}

hdfs-site{
          "dfs.client.failover.proxy.provider.dfs-nameservices":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
          "dfs.ha.automatic-failover.enabled": "true",
          "dfs.ha.fencing.methods": "shell(/bin/true)",
          "dfs.ha.namenodes.dfs-nameservices": "nn1,nn2",
          "dfs.namenode.http-address.dfs-nameservices.nn1":
"%HOSTGROUP::host_group_master_1%:50070",
          "dfs.namenode.http-address.dfs-nameservices.nn2":
"%HOSTGROUP::host_group_master_2%:50070",
          "dfs.namenode.https-address.dfs-nameservices.nn1":
"%HOSTGROUP::host_group_master_1%:50470",
          "dfs.namenode.https-address.dfs-nameservices.nn2":
"%HOSTGROUP::host_group_master_2%:50470",
          "dfs.namenode.rpc-address.dfs-nameservices.nn1":
"%HOSTGROUP::host_group_master_1%:8020",
          "dfs.namenode.rpc-address.dfs-nameservices.nn2":
"%HOSTGROUP::host_group_master_2%:8020",
          "dfs.namenode.shared.edits.dir":
"qjournal://%HOSTGROUP::host_group_master_1%:8485;%HOSTGROUP::host_group_master_2%:8485;%HOSTGROUP::host_group_master_3%:8485/dfs-nameservices",
          "dfs.nameservices": "dfs-nameservices"

}


core-site{
          "fs.defaultFS": "hdfs://dfs-nameservices",
          "ha.zookeeper.quorum":
"%HOSTGROUP::host_group_master_1%:2181,%HOSTGROUP::host_group_master_2%:2181,%HOSTGROUP::host_group_master_3%:2181"

}



This is the log message of Standby Namenode server.

2015-08-25 08:26:26,373 INFO  zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client
environment:user.dir=/usr/hdp/2.2.6.0-2800/hadoop
2015-08-25 08:26:26,380 INFO  zookeeper.ZooKeeper
(ZooKeeper.java:<init>(438)) - Initiating client connection,
connectString=usw2ha2dpma01.local:2181,usw2ha2dpma02.local:2181,usw2ha2dpma03.local:2181
sessionTimeout=5000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5b7a5baa
2015-08-25 08:26:26,399 INFO  zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to
server usw2ha2dpma02.local/172.17.213.51:2181. Will not attempt to
authenticate using SASL (unknown error)
2015-08-25 08:26:26,405 INFO  zookeeper.ClientCnxn
(ClientCnxn.java:primeConnection(852)) - Socket connection established to
usw2ha2dpma02.local/172.17.213.51:2181, initiating session
2015-08-25 08:26:26,413 INFO  zookeeper.ClientCnxn
(ClientCnxn.java:onConnected(1235)) - Session establishment complete on
server usw2ha2dpma02.local/172.17.213.51:2181, sessionid =
0x24f63f6f3050001, negotiated timeout = 5000
2015-08-25 08:26:26,416 INFO  ha.ActiveStandbyElector
(ActiveStandbyElector.java:processWatchEvent(547)) - Session connected.
2015-08-25 08:26:26,441 INFO  ipc.CallQueueManager
(CallQueueManager.java:<init>(53)) - Using callQueue class
java.util.concurrent.LinkedBlockingQueue
2015-08-25 08:26:26,472 INFO  ipc.Server (Server.java:run(605)) - Starting
Socket Reader #1 for port 8019
2015-08-25 08:26:26,520 INFO  ipc.Server (Server.java:run(827)) - IPC
Server Responder: starting
2015-08-25 08:26:26,526 INFO  ipc.Server (Server.java:run(674)) - IPC
Server listener on 8019: starting
2015-08-25 08:26:27,596 INFO  ipc.Client
(Client.java:handleConnectionFailure(859)) - Retrying connect to server:
usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
MILLISECONDS)
2015-08-25 08:26:27,615 WARN  ha.HealthMonitor
(HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020:
Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
failed on connection exception: java.net.ConnectException: Connection
refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
2015-08-25 08:26:27,616 INFO  ha.HealthMonitor
(HealthMonitor.java:enterState(238)) - Entering state SERVICE_NOT_RESPONDING
2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
(ZKFailoverController.java:setLastHealthState(850)) - Local service
NameNode at usw2ha2dpma02.local/172.17.213.51:8020 entered state:
SERVICE_NOT_RESPONDING
2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
(ZKFailoverController.java:recheckElectability(766)) - Quitting master
election for NameNode at usw2ha2dpma02.local/172.17.213.51:8020 and marking
that fencing is necessary
2015-08-25 08:26:27,617 INFO  ha.ActiveStandbyElector
(ActiveStandbyElector.java:quitElection(354)) - Yielding from election
2015-08-25 08:26:27,621 INFO  zookeeper.ClientCnxn
(ClientCnxn.java:run(512)) - EventThread shut down
2015-08-25 08:26:27,621 INFO  zookeeper.ZooKeeper
(ZooKeeper.java:close(684)) - Session: 0x24f63f6f3050001 closed
2015-08-25 08:26:29,623 INFO  ipc.Client
(Client.java:handleConnectionFailure(859)) - Retrying connect to server:
usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
MILLISECONDS)
2015-08-25 08:26:29,624 WARN  ha.HealthMonitor
(HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020:
Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
failed on connection exception: java.net.ConnectException: Connection
refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
2015-08-25 08:26:31,626 INFO  ipc.Client
(Client.java:handleConnectionFailure(859)) - Retrying connect to server:
usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
MILLISECONDS)
2015-08-25 08:26:31,627 WARN  ha.HealthMonitor
(HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020:
Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
failed on connection exception: java.net.ConnectException: Connection
refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
2015-08-25 08:26:33,629 INFO  ipc.Client
(Client.java:handleConnectionFailure(859)) - Retrying connect to server:
usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
MILLISECONDS)
2015-08-25 08:26:33,630 WARN  ha.HealthMonitor
(HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
to

Mime
View raw message