Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 48DED103F9 for ; Tue, 4 Mar 2014 14:53:51 +0000 (UTC) Received: (qmail 2224 invoked by uid 500); 4 Mar 2014 14:53:41 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 1856 invoked by uid 500); 4 Mar 2014 14:53:39 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 1845 invoked by uid 99); 4 Mar 2014 14:53:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2014 14:53:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of write2kishore@gmail.com designates 209.85.216.43 as permitted sender) Received: from [209.85.216.43] (HELO mail-qa0-f43.google.com) (209.85.216.43) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2014 14:53:31 +0000 Received: by mail-qa0-f43.google.com with SMTP id j15so4338285qaq.16 for ; Tue, 04 Mar 2014 06:53:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=WO9j5ipp9NIMkphoSTf9EiOgMxnV/ZQa3SOdlyzNwwg=; b=zHK6ZCO7TqjjdzmYW9N6vS8mTqH3BILg+fly/nVvAUJcyL+sz0BfuPuqPgk+ZwUJ5N 0Gr7ppQX424QPbOx1N6MUnbjz4ZGbb6EtIA8c9U8t9eMCQ8URCAtOeX4GcFcbAUXtu1I 4tu2JdElKdhQ3n3Ny9VjPMV3GyEKmZB7JyvOj8YbgPnKF3aeZLQnSCLMIzWC/RG29DWM kgXKCaHKBRTIFzilMbE4xDLqAVwRziO9KUgTqbwpY1s+P/FfXzuBu0dSTqVnzFofmeES 880UJ+ZD88EdpiQL0fbkdw8HgFpUU0T8uanhpuUa1P8av5d1FkXJRyayGJ1XOl555Feu 4wqw== MIME-Version: 1.0 X-Received: by 10.140.30.66 with SMTP id c60mr29304640qgc.13.1393944790047; Tue, 04 Mar 2014 06:53:10 -0800 (PST) Received: by 10.96.97.74 with HTTP; Tue, 4 Mar 2014 06:53:09 -0800 (PST) Date: Tue, 4 Mar 2014 20:23:09 +0530 Message-ID: Subject: Node manager or Resource Manager crash From: Krishna Kishore Bonagiri To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a113a09e80c221a04f3c910e0 X-Virus-Checked: Checked by ClamAV on apache.org --001a113a09e80c221a04f3c910e0 Content-Type: text/plain; charset=ISO-8859-1 Hi, I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files. One of the times that NM had got killed today, the tail of the it's log is like this: 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true, And at the time of NM's crash, the RM's log has the following entries: 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes. 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 availableResource: 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server: got #151 Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed? Thanks, Kishore --001a113a09e80c221a04f3c910e0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi,
=A0 I am running an application on a 2-node cluste= r, which tries to acquire all the containers that are available on one of t= hose nodes and remaining containers from the other node in the cluster. Whe= n I run this application continuously in a loop, one of the NM or RM is get= ting killed at a random point. There is no corresponding message in the log= files.

One of the times that NM had got killed today, the tail= of the it's log is like this:

2014-03-04= 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpd= aterImpl: isredeng:52867 sending out status for 16 containers
2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanage= r.NodeStatusUpdaterImpl: Node's health-status : true,


And at the time of NM's crash, the RM's log = has the following entries:

2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.se= rver.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type S= TATUS_UPDATE
2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.= yarn.event.AsyncDispatcher: Dispatching the event=A0
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpda= teSchedulerEvent.EventType: NODE_UPDATE
2014-03-04 02:42:40,3= 71 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to = org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from=A0
9.70.137.184:33696 Call#1406= 0 Retry#0 Wrote 40 bytes.
2014-03-04 02:42:40,371 DEBUG org.a= pache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedul= er: nodeUpdate: isredeng:52867 clusterResources:=A0
<memory:16384, vCores:16>
2014-03-04 02:42:40,371 = DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Capa= cityScheduler: Node being looked for scheduling isredeng:52867=A0
availableResource: <memory:0, vCores:-8>
2014-03-04 02:= 42:40,393 DEBUG org.apache.hadoop.ipc.Server: =A0got #151
<= br>

Note: the name of the node on which NM has got= killed is isredeng, does it indicate anything from the above message as to= why it got killed?

Thanks,
Kishore


=

--001a113a09e80c221a04f3c910e0--