hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Kishore Bonagiri <write2kish...@gmail.com>
Subject Node manager or Resource Manager crash
Date Tue, 04 Mar 2014 14:53:09 GMT
Hi,
  I am running an application on a 2-node cluster, which tries to acquire
all the containers that are available on one of those nodes and remaining
containers from the other node in the cluster. When I run this application
continuously in a loop, one of the NM or RM is getting killed at a random
point. There is no corresponding message in the log files.

One of the times that NM had got killed today, the tail of the it's log is
like this:

2014-03-04 02:42:44,386 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
isredeng:52867 sending out status for 16 containers
2014-03-04 02:42:44,386 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,


And at the time of NM's crash, the RM's log has the following entries:

2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
isredeng:52867 of type STATUS_UPDATE
2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher:
Dispatching the event
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
NODE_UPDATE
2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
Responder: responding to
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
nodeUpdate: isredeng:52867 clusterResources:
<memory:16384, vCores:16>
2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Node being looked for scheduling isredeng:52867
availableResource: <memory:0, vCores:-8>
2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151


Note: the name of the node on which NM has got killed is isredeng, does it
indicate anything from the above message as to why it got killed?

Thanks,
Kishore

Mime
View raw message