incubator-ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Wagle (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-3368) NameNode start hangs with HA config'd
Date Fri, 27 Sep 2013 21:02:03 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780392#comment-13780392
] 

Siddharth Wagle commented on AMBARI-3368:
-----------------------------------------

Upon further investigation we find that the dfs client tries to connect to original NN and
when the connection times out it tries the other NN. 
This will result in slow down of jobs running after failover.

{code}
[root@ambari-nn-ha-2 data]# time su - hdfs -c 'hadoop --config /etc/hadoop/conf fs -chown
hcat /user/hcat'
13/09/24 14:09:48 DEBUG retry.RetryInvocationHandler: Exception while invoking getFileInfo
of class ClientNamenodeProtocolTranslatorPB. Trying to fail over immediately.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category
READ is not supported in state standby
	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1496)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1029)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3269)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(Na
{code}

Time:
{code}
real	0m3.996s
user	0m2.697s
sys	0m0.147s
{code}
                
> NameNode start hangs with HA config'd
> -------------------------------------
>
>                 Key: AMBARI-3368
>                 URL: https://issues.apache.org/jira/browse/AMBARI-3368
>             Project: Ambari
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.4.1
>            Reporter: Siddharth Wagle
>            Assignee: Siddharth Wagle
>             Fix For: 1.4.1
>
>
> After configuring NameNode HA, I found starting a namenode hangs and fails with "Puppet
has been killed due to timeout"
> 1) Install cluster
> 2) enable NameNode HA
> 3) Stop standby namenode on Hosts details page
> 4) Stop active namenode on Hosts details page
> 5) Start namenode on Hosts details page
> 6) Hangs on start. stops at 35% complete. Then after ~ 10 minutes, puppet has been killed
due to timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message