hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takahiko Kawasaki <daru...@gmail.com>
Subject DataNodes fail to send heartbeat to HA-enabled NameNode
Date Tue, 30 Oct 2012 11:10:44 GMT
Hello,

I have trouble in quorum-based HDFS HA of CDH 4.1.1.

NameNode Web UI of Cloudera Manager reports NameNode status.
Its has "Cluster Summary" section and my cluster is summarized
there like below.

--- Cluster Summary ---
Configured Capacity   : 0 KB
DFS Used              : 0 KB
Non DFS Used          : 0 KB
DFS Remaining         : 0 KB
DFS Used%             : 100 %
DFS Remaining%        : 0 %
Block Pool Used       : 0 KB
Block Pool Used%      : 100 %
DataNodes usages      : Min %  Median %  Max %  stdev %
                          0 %       0 %    0 %      0 %
Live Nodes            : 0 (Decommissioned: 0)
Dead Nodes            : 5 (Decommissioned: 0)
Decommissioning Nodes : 0
--------------------

As you can see, all the DataNodes are regarded as dead.

I found DataNodes continued to emit logs about failure to
send heartbeat to NameNode.

---- DataNode Log (host names were manually edited) ---
2012-10-30 19:28:16,817 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode
node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of
300000 msec  BLOCKREPORT_INTERVAL of 21600000msec Initial delay:
0msec; heartBeatInterval=3000
2012-10-30 19:28:16,817 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
BPOfferService for Block pool
BP-2063217961-192.168.62.231-1351263110470 (storage id
DS-2090122187-192.168.62.233-50010-1338981658216) service to
node02.example.com/192.168.62.232:8020
java.lang.NullPointerException
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
        at java.lang.Thread.run(Thread.java:662)
--------------------

So, I guess that DataNodes are failing to locate the name service
for some reasons, but I don't have any clue to solve the problem.

I confirmed that
/var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml
of a DataNode contains

--- core-site.xml ---
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://nameservice1</value>
  </property>
--------------------

and hdfs-site.xml contains

--- hdfs-site.xml ---
  <property>
    <name>dfs.nameservices</name>
    <value>nameservice1</value>
  </property>
  <property>
    <name>dfs.client.failover.proxy.provider.nameservice1</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
    <name>dfs.ha.namenodes.nameservice1</name>
    <value>namenode38,namenode90</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode38</name>
    <value>node01.example.com:8020</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode38</name>
    <value>node01.example.com:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode38</name>
    <value>node01.example.com:50470</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
    <value>node02.example.com:8020</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode90</name>
    <value>node02.example.com:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode90</name>
    <value>jbmnode02.jibemobile.jp:50470</value>
  </property>
  <property>
    <name>dfs.permissions.superusergroup</name>
    <value>supergroup</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.replication.min</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.replication.max</name>
    <value>512</value>
  </property>
--------------------

The following was my trial to create a file in HDFS but in vain.

--------------------
# vi /tmp/test.txt
# sudo -u hdfs hadoop fs -mkdir /takahiko
# sudo -u hdfs hadoop fs -ls /
Found 3 items
drwxr-xr-x   - hbase hbase               0 2012-10-30 15:12 /hbase
drwxr-xr-x   - hdfs  supergroup          0 2012-10-30 18:55 /takahiko
drwxrwxrwt   - hdfs  hdfs                0 2012-10-26 23:58 /tmp
# sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/
12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
instead of minReplication (=1).  There are 0 datanode(s) running and
no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

        at org.apache.hadoop.ipc.Client.call(Client.java:1160)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at $Proxy9.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
        at $Proxy10.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
copyFromLocal: File /takahiko/test.txt._COPYING_ could only be
replicated to 0 nodes instead of minReplication (=1).  There are 0
datanode(s) running and no node(s) are excluded in this operation.
12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file
/takahiko/test.txt._COPYING_
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
instead of minReplication (=1).  There are 0 datanode(s) running and
no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

        at org.apache.hadoop.ipc.Client.call(Client.java:1160)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at $Proxy9.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
        at $Proxy10.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
--------------------


Could anyone give me any hint to solve the problem?

Best Regards,
Takahiko Kawasaki

Mime
View raw message