ambari-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Nettleton <rnettle...@hortonworks.com>
Subject Re: NameNode HA -Blueprints - Standby NN failed and Active NN created
Date Wed, 26 Aug 2015 13:35:02 GMT
Hi Anand,

I just tried out a simple HDFS HA deployment (with Ambari 2.1.0), using the HOSTGROUP syntax
for these two properties, and it failed as I expected.

I’m not sure why “dfs_ha_initial_namenode_active” includes the FQDN.  I suspect that
there is some other problem that is causing this.

As I mentioned before, these two properties are not currently meant for %HOSTGROUP% substitution,
so the fix is to specify the FQDNs within these properties.

If you are concerned about including hostnames in your Blueprint, for portability concerns,
then you can always set these properties in the cluster creation template.

If you don’t need to select the initial state of the namenodes in your cluster, you can
just remove these properties from your Blueprint, and the Blueprint processor will select
an “active” and “standby” namenode.

If it still appears to you that the property is being set by the Blueprints processor, please
feel free to file a JIRA to track the investigation into this.

Hope this helps! 

Thanks,
Bob

On Aug 26, 2015, at 2:29 AM, Anandha L Ranganathan <analog.sony@gmail.com> wrote:

> + dev group.
> 
> 
> This is what I found in the /var/lib/ambari-agent/data/command-#.json in
> the one of the master host.
> In this you can see the , the active namenode is substituted by FQDN but
> not the the standby node. Is this a bug in the Ambari  version.
> 
> I am using *Ambari 2.1* version.
> 
>  hadoop-env{
> 
>            "dfs_ha_initial_namenode_active": "usw2ha3dpma01.local",
>            "hadoop_root_logger": "INFO,RFA",
>            "dfs_ha_initial_namenode_standby":
> "%HOSTGROUP::host_group_master_2%",
>            "namenode_opt_permsize": "128m"
> }
> 
> 
> Thanks
> Anand
> 
> 
> On Tue, Aug 25, 2015 at 11:23 AM Anandha L Ranganathan <
> analog.sony@gmail.com> wrote:
> 
>> 
>> Hi
>> 
>> I am trying to install Active Namenode HA using blueprints.
>> During the cluster creation through scripts, it does  following and
>> completes.
>> 
>> 1) Journal nodes starts and initialized (formats journal node).
>> 2) Initialization the HA state in zookeeper  or ZKFC ( Both in Active and
>> Standby namenode )
>> After 96% it fails.    I logged into the cluster using UI and re-started
>> the standby namenode. But it throw the exception saying that Namenode not
>> formatted.
>> I have to manually copy the fsimage logs from using this command, "hdfs
>> namenode -bootstrapStandby -force " in the standby NN server.
>> and re-starting the namenode  works fine and  goes into standby mode.
>> 
>> Is it something I am missing in the configuration ?
>> My Namenode HA blue prints looks like this.
>> 
>> hadoop-env{
>> "dfs_ha_initial_namenode_active": "%HOSTGROUP::host_group_master_1%"
>> "dfs_ha_initial_namenode_standby": "%HOSTGROUP::host_group_master_2"
>> }
>> 
>> 
>> hadoop-ev{
>> 
>>        "dfs_ha_initial_namenode_active":
>> "%HOSTGROUP::host_group_master_1%"
>>        "dfs_ha_initial_namenode_standby":
>> "%HOSTGROUP::host_group_master_2"
>> }
>> 
>> hdfs-site{
>>          "dfs.client.failover.proxy.provider.dfs-nameservices":
>> "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
>>          "dfs.ha.automatic-failover.enabled": "true",
>>          "dfs.ha.fencing.methods": "shell(/bin/true)",
>>          "dfs.ha.namenodes.dfs-nameservices": "nn1,nn2",
>>          "dfs.namenode.http-address.dfs-nameservices.nn1":
>> "%HOSTGROUP::host_group_master_1%:50070",
>>          "dfs.namenode.http-address.dfs-nameservices.nn2":
>> "%HOSTGROUP::host_group_master_2%:50070",
>>          "dfs.namenode.https-address.dfs-nameservices.nn1":
>> "%HOSTGROUP::host_group_master_1%:50470",
>>          "dfs.namenode.https-address.dfs-nameservices.nn2":
>> "%HOSTGROUP::host_group_master_2%:50470",
>>          "dfs.namenode.rpc-address.dfs-nameservices.nn1":
>> "%HOSTGROUP::host_group_master_1%:8020",
>>          "dfs.namenode.rpc-address.dfs-nameservices.nn2":
>> "%HOSTGROUP::host_group_master_2%:8020",
>>          "dfs.namenode.shared.edits.dir":
>> "qjournal://%HOSTGROUP::host_group_master_1%:8485;%HOSTGROUP::host_group_master_2%:8485;%HOSTGROUP::host_group_master_3%:8485/dfs-nameservices",
>>          "dfs.nameservices": "dfs-nameservices"
>> 
>> }
>> 
>> 
>> core-site{
>>          "fs.defaultFS": "hdfs://dfs-nameservices",
>>          "ha.zookeeper.quorum":
>> "%HOSTGROUP::host_group_master_1%:2181,%HOSTGROUP::host_group_master_2%:2181,%HOSTGROUP::host_group_master_3%:2181"
>> 
>> }
>> 
>> 
>> 
>> This is the log message of Standby Namenode server.
>> 
>> 2015-08-25 08:26:26,373 INFO  zookeeper.ZooKeeper
>> (Environment.java:logEnv(100)) - Client
>> environment:user.dir=/usr/hdp/2.2.6.0-2800/hadoop
>> 2015-08-25 08:26:26,380 INFO  zookeeper.ZooKeeper
>> (ZooKeeper.java:<init>(438)) - Initiating client connection,
>> connectString=usw2ha2dpma01.local:2181,usw2ha2dpma02.local:2181,usw2ha2dpma03.local:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5b7a5baa
>> 2015-08-25 08:26:26,399 INFO  zookeeper.ClientCnxn
>> (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to
>> server usw2ha2dpma02.local/172.17.213.51:2181. Will not attempt to
>> authenticate using SASL (unknown error)
>> 2015-08-25 08:26:26,405 INFO  zookeeper.ClientCnxn
>> (ClientCnxn.java:primeConnection(852)) - Socket connection established to
>> usw2ha2dpma02.local/172.17.213.51:2181, initiating session
>> 2015-08-25 08:26:26,413 INFO  zookeeper.ClientCnxn
>> (ClientCnxn.java:onConnected(1235)) - Session establishment complete on
>> server usw2ha2dpma02.local/172.17.213.51:2181, sessionid =
>> 0x24f63f6f3050001, negotiated timeout = 5000
>> 2015-08-25 08:26:26,416 INFO  ha.ActiveStandbyElector
>> (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected.
>> 2015-08-25 08:26:26,441 INFO  ipc.CallQueueManager
>> (CallQueueManager.java:<init>(53)) - Using callQueue class
>> java.util.concurrent.LinkedBlockingQueue
>> 2015-08-25 08:26:26,472 INFO  ipc.Server (Server.java:run(605)) - Starting
>> Socket Reader #1 for port 8019
>> 2015-08-25 08:26:26,520 INFO  ipc.Server (Server.java:run(827)) - IPC
>> Server Responder: starting
>> 2015-08-25 08:26:26,526 INFO  ipc.Server (Server.java:run(674)) - IPC
>> Server listener on 8019: starting
>> 2015-08-25 08:26:27,596 INFO  ipc.Client
>> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
>> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> 2015-08-25 08:26:27,615 WARN  ha.HealthMonitor
>> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
>> to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020:
>> Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
>> failed on connection exception: java.net.ConnectException: Connection
>> refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused
>> 2015-08-25 08:26:27,616 INFO  ha.HealthMonitor
>> (HealthMonitor.java:enterState(238)) - Entering state SERVICE_NOT_RESPONDING
>> 2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
>> (ZKFailoverController.java:setLastHealthState(850)) - Local service
>> NameNode at usw2ha2dpma02.local/172.17.213.51:8020 entered state:
>> SERVICE_NOT_RESPONDING
>> 2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
>> (ZKFailoverController.java:recheckElectability(766)) - Quitting master
>> election for NameNode at usw2ha2dpma02.local/172.17.213.51:8020 and
>> marking that fencing is necessary
>> 2015-08-25 08:26:27,617 INFO  ha.ActiveStandbyElector
>> (ActiveStandbyElector.java:quitElection(354)) - Yielding from election
>> 2015-08-25 08:26:27,621 INFO  zookeeper.ClientCnxn
>> (ClientCnxn.java:run(512)) - EventThread shut down
>> 2015-08-25 08:26:27,621 INFO  zookeeper.ZooKeeper
>> (ZooKeeper.java:close(684)) - Session: 0x24f63f6f3050001 closed
>> 2015-08-25 08:26:29,623 INFO  ipc.Client
>> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
>> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> 2015-08-25 08:26:29,624 WARN  ha.HealthMonitor
>> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
>> to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020:
>> Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
>> failed on connection exception: java.net.ConnectException: Connection
>> refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused
>> 2015-08-25 08:26:31,626 INFO  ipc.Client
>> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
>> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> 2015-08-25 08:26:31,627 WARN  ha.HealthMonitor
>> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
>> to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020:
>> Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
>> failed on connection exception: java.net.ConnectException: Connection
>> refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused
>> 2015-08-25 08:26:33,629 INFO  ipc.Client
>> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
>> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> 2015-08-25 08:26:33,630 WARN  ha.HealthMonitor
>> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
>> to
>> 
>> 


Mime
View raw message