hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohith Sharma K S <rohithsharm...@huawei.com>
Subject RE: NodeManager health Question
Date Fri, 14 Mar 2014 02:55:30 GMT
Hi ,

  As troubleshooting, few things  you can verify

1.     check RM web UI for "Is there any 'Active Nodes' in Yarn cluster"?. http://< yarn.resourcemanager.webapp.address>/cluster.

And also verify for "Lost Nodes" or "Unhealthy Nodes" or "Rebooted Nodes".
                 If there any active nodes, then cross verify for "Memory Total". This should
be "Memory Total  = Number of Active Nodes * value of { yarn.nodemanager.resource.memory-mb

2.     NodeManger logs give more information. NM logs also check.

>>> In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run
             This may be  your Yarn Cluster does not have enough memory to launch container.
Possible reason could be

1.     None of the NM are sending heart beat to RM.(check RM Web UI for Unhealthy Nodes)

2.     All the NM are lost/unhealthy.

3.     Full cluster capacity is Used. So yarn scheduler is waiting for some container to get
over, so it can assign released memory to other containers.

        Looking into  your DataNode socket timeout exception ( that too 8 minutes!!!), I suspect
that Hadoop cluster Network is UNSTABLE. Better to debug on network.

Thanks & Regards
Rohith Sharma K S

From: Clay McDonald [mailto:stuart.mcdonald@bateswhite.com]
Sent: 14 March 2014 01:30
To: 'user@hadoop.apache.org'
Subject: NodeManager health Question

Hello all, I have laid out my POC in a project plan and have HDP 2.0 installed. HDFS is running
fine and have loaded up about 6TB of data to run my test on. I have a series of SQL queries
that I will run in Hive ver. 0.12.0. I had to manually install Hue and still have a few issues
I'm working on there. But at the moment, my most pressing issue is with Hive jobs not running.
In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run. See attached.

In Ambari, the datanodes all have the following error; NodeManager health CRIT for 20 days
CRITICAL: NodeManager unhealthy

>From the datanode logs I found the following;

ERROR datanode.DataNode (DataXceiver.java:run(225)) - dc-bigdata1.bateswhite.com:50010:DataXceiver
error processing READ_BLOCK operation  src: / dest: /
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready
for write. ch : java.nio.channels.SocketChannel[connected local=/ remote=/]
            at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
            at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
            at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
            at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
            at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
            at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)
            at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
            at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
            at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
            at java.lang.Thread.run(Thread.java:662)

Also, in the namenode log I see the following;

2014-03-13 13:50:57,204 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1355))
- No groups available for user dr.who

If anyone can point me in the right direction to troubleshoot this, I would really appreciate

Thanks! Clay

View raw message