Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC04710D65 for ; Thu, 25 Jul 2013 10:54:57 +0000 (UTC) Received: (qmail 74319 invoked by uid 500); 25 Jul 2013 10:54:51 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 74262 invoked by uid 500); 25 Jul 2013 10:54:51 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 74255 invoked by uid 99); 25 Jul 2013 10:54:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jul 2013 10:54:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of devaraj.k@huawei.com designates 119.145.14.65 as permitted sender) Received: from [119.145.14.65] (HELO szxga02-in.huawei.com) (119.145.14.65) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jul 2013 10:54:43 +0000 Received: from 172.24.2.119 (EHLO szxeml210-edg.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.4-GA FastPath queued) with ESMTP id BFC36685; Thu, 25 Jul 2013 18:54:21 +0800 (CST) Received: from SZXEML409-HUB.china.huawei.com (10.82.67.136) by szxeml210-edg.china.huawei.com (172.24.2.183) with Microsoft SMTP Server (TLS) id 14.1.323.7; Thu, 25 Jul 2013 18:54:14 +0800 Received: from szxeml560-mbx.china.huawei.com ([169.254.3.63]) by szxeml409-hub.china.huawei.com ([10.82.67.136]) with mapi id 14.01.0323.007; Thu, 25 Jul 2013 18:54:16 +0800 From: Devaraj k To: "user@hadoop.apache.org" Subject: RE: Node manager crashing when running an app requiring 100 containers on hadoop-2.1.0-beta RC0 Thread-Topic: Node manager crashing when running an app requiring 100 containers on hadoop-2.1.0-beta RC0 Thread-Index: AQHOiSNV3sfkGVDlh0623fiAQsMj4Jl1NsKQ Date: Thu, 25 Jul 2013 10:54:16 +0000 Message-ID: <06006DDA5A27D541991944AC4117E7A96E1E1722@szxeml560-mbx.china.huawei.com> References: In-Reply-To: Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.168.85] Content-Type: multipart/alternative; boundary="_000_06006DDA5A27D541991944AC4117E7A96E1E1722szxeml560mbxchi_" MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org --_000_06006DDA5A27D541991944AC4117E7A96E1E1722szxeml560mbxchi_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Kishore, It seems that system doesn't have enough resources to launch a new thread. Could you check the system is affordable to launch the configured container= s and try increasing the native memory available in the system by reducing = the no of running processes in the system. Thanks Devaraj k From: Krishna Kishore Bonagiri [mailto:write2kishore@gmail.com] Sent: 25 July 2013 16:09 To: user@hadoop.apache.org Subject: Node manager crashing when running an app requiring 100 containers= on hadoop-2.1.0-beta RC0 Hi, I am running an application against hadoop-2.1.0-beta RC, and my app requ= ires 117 containers, I have got all the containers allocated, but while sta= rting those containers, at around 99th container the node manager has gone = down with the following kind of error in it's log. Also, I could reproduce = this error running a "sleep 200; date" command using the Distributed Shell = example, in which case I got this error at around 66th container. 2013-07-25 06:07:17,743 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionH= andler: Thread Thread[process reaper,5,main] threw an Error. Shutting down= now... java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, = errno 11 at java.lang.Thread.startImpl(Native Method) at java.lang.Thread.start(Thread.java:887) at java.lang.ProcessInputStream.(UNIXProcess.java:472) at java.lang.UNIXProcess$1$1$1.run(UNIXProcess.java:157) at java.security.AccessController.doPrivileged(AccessController.jav= a:202) at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:137) 2013-07-25 06:07:17,745 INFO org.apache.hadoop.util.ExitUtil: Halt with sta= tus -1 Message: HaltException Thanks, Kishore --_000_06006DDA5A27D541991944AC4117E7A96E1E1722szxeml560mbxchi_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hi Kishore,

 <= /p>

It seems that system doesn’t have enough resources to launch a ne= w thread.

 <= /p>

Could you check the syste= m is affordable to launch the configured containers and try increasing the = native memory available in the system by reducing the no of running processes in the system.

 <= /p>

Thanks<= /p>

Devaraj k

 <= /p>

From: Krishna = Kishore Bonagiri [mailto:write2kishore@gmail.com]
Sent: 25 July 2013 16:09
To: user@hadoop.apache.org
Subject: Node manager crashing when running an app requiring 100 con= tainers on hadoop-2.1.0-beta RC0

 

Hi,

 

  I am running an application against hadoop-2.= 1.0-beta RC, and my app requires 117 containers, I have got all the contain= ers allocated, but while starting those containers, at around 99th containe= r the node manager has gone down with the following kind of error in it's log. Also, I could reproduce this error ru= nning a "sleep 200; date" command using the Distributed Shell exa= mple, in which case I got this error at around 66th container.

 

 

2013-07-25 06:07:17,743 FATAL org.apache.hadoop.yarn= .YarnUncaughtExceptionHandler: Thread Thread[process reaper,5,main] threw a= n Error.  Shutting down now...

java.lang.OutOfMemoryError: Failed to create a threa= d: retVal -1073741830, errno 11

        at java.lang.Thread.star= tImpl(Native Method)

        at java.lang.Thread.star= t(Thread.java:887)

        at java.lang.ProcessInpu= tStream.<init>(UNIXProcess.java:472)

        at java.lang.UNIXProcess= $1$1$1.run(UNIXProcess.java:157)

        at java.security.AccessC= ontroller.doPrivileged(AccessController.java:202)

        at java.lang.UNIXProcess= $1$1.run(UNIXProcess.java:137)

2013-07-25 06:07:17,745 INFO org.apache.hadoop.util.= ExitUtil: Halt with status -1 Message: HaltException

 

Thanks,

Kishore

--_000_06006DDA5A27D541991944AC4117E7A96E1E1722szxeml560mbxchi_--