Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of write2kishore@gmail.com
 designates 209.85.216.43 as permitted sender)
MIME-Version: 1.0
Date: Tue, 4 Mar 2014 20:23:09 +0530
Message-ID: 
 <CAHg+sbP7PPhtMXiY6F95SBAqcUdr4=2NGmtucMmpRj3BD6obkg@mail.gmail.com>
Subject: Node manager or Resource Manager crash
From: Krishna Kishore Bonagiri <write2kishore@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a113a09e80c221a04f3c910e0

--001a113a09e80c221a04f3c910e0
Content-Type: text/plain; charset=ISO-8859-1

Hi,
  I am running an application on a 2-node cluster, which tries to acquire
all the containers that are available on one of those nodes and remaining
containers from the other node in the cluster. When I run this application
continuously in a loop, one of the NM or RM is getting killed at a random
point. There is no corresponding message in the log files.

One of the times that NM had got killed today, the tail of the it's log is
like this:

2014-03-04 02:42:44,386 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
isredeng:52867 sending out status for 16 containers
2014-03-04 02:42:44,386 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,


And at the time of NM's crash, the RM's log has the following entries:

2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
isredeng:52867 of type STATUS_UPDATE
2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher:
Dispatching the event
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
NODE_UPDATE
2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
Responder: responding to
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
nodeUpdate: isredeng:52867 clusterResources:
<memory:16384, vCores:16>
2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Node being looked for scheduling isredeng:52867
availableResource: <memory:0, vCores:-8>
2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151


Note: the name of the node on which NM has got killed is isredeng, does it
indicate anything from the above message as to why it got killed?

Thanks,
Kishore

--001a113a09e80c221a04f3c910e0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div>=A0 I am running an application on a 2-node cluste=
r, which tries to acquire all the containers that are available on one of t=
hose nodes and remaining containers from the other node in the cluster. Whe=
n I run this application continuously in a loop, one of the NM or RM is get=
ting killed at a random point. There is no corresponding message in the log=
 files.<br>
<div><br></div><div>One of the times that NM had got killed today, the tail=
 of the it&#39;s log is like this:</div><div><br></div><div><div>2014-03-04=
 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpd=
aterImpl: isredeng:52867 sending out status for 16 containers</div>
<div>2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanage=
r.NodeStatusUpdaterImpl: Node&#39;s health-status : true,</div><div><br></d=
iv><div><br></div><div>And at the time of NM&#39;s crash, the RM&#39;s log =
has the following entries:</div>
<div><br></div><div>2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.se=
rver.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type S=
TATUS_UPDATE<br></div><div>2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.=
yarn.event.AsyncDispatcher: Dispatching the event=A0</div>
<div>org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpda=
teSchedulerEvent.EventType: NODE_UPDATE<br></div><div>2014-03-04 02:42:40,3=
71 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to =
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from=A0</=
div>
<div><a href=3D"http://9.70.137.184:33696">9.70.137.184:33696</a> Call#1406=
0 Retry#0 Wrote 40 bytes.<br></div><div>2014-03-04 02:42:40,371 DEBUG org.a=
pache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedul=
er: nodeUpdate: isredeng:52867 clusterResources:=A0</div>
<div>&lt;memory:16384, vCores:16&gt;<br></div><div>2014-03-04 02:42:40,371 =
DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Capa=
cityScheduler: Node being looked for scheduling isredeng:52867=A0</div><div=
>
availableResource: &lt;memory:0, vCores:-8&gt;<br></div><div>2014-03-04 02:=
42:40,393 DEBUG org.apache.hadoop.ipc.Server: =A0got #151</div></div><div><=
br></div><div><br></div><div>Note: the name of the node on which NM has got=
 killed is isredeng, does it indicate anything from the above message as to=
 why it got killed?</div>
<div><br></div><div>Thanks,</div><div>Kishore</div><div><br></div><div><br>=
</div><div><br></div></div></div>

--001a113a09e80c221a04f3c910e0--