hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Kishore Bonagiri <write2kish...@gmail.com>
Subject Re: Node manager or Resource Manager crash
Date Wed, 05 Mar 2014 05:28:04 GMT
Yes Vinod, I was asking this question sometime back, and I got back to
resolve the issue again.

I tried to see if the OOM is killing but it is not. I have checked the free
swap  space on my box while my test is going on, but it doesn't seem to be
the issue. Also, I  have verified if OOM score is going high for any of
these process because that is when OOM killer kills them, but they are not
going high too.

Thanks,
Kishore


On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Mime
View raw message