Return-Path: X-Original-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E28587713 for ; Mon, 10 Oct 2011 19:46:57 +0000 (UTC) Received: (qmail 99405 invoked by uid 500); 10 Oct 2011 19:46:57 -0000 Delivered-To: apmail-incubator-giraph-user-archive@incubator.apache.org Received: (qmail 99355 invoked by uid 500); 10 Oct 2011 19:46:57 -0000 Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-user@incubator.apache.org Delivered-To: mailing list giraph-user@incubator.apache.org Received: (qmail 99347 invoked by uid 99); 10 Oct 2011 19:46:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Oct 2011 19:46:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of guzhiwei@gmail.com designates 209.85.210.41 as permitted sender) Received: from [209.85.210.41] (HELO mail-pz0-f41.google.com) (209.85.210.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Oct 2011 19:46:51 +0000 Received: by pzk5 with SMTP id 5so18459861pzk.0 for ; Mon, 10 Oct 2011 12:46:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=AHrdOOMCtS6F6IBHQADyamRu4zLhx80Tf2pjaFTAFwU=; b=UN7w/UCj4ojY/60Hd5lRh0uFDX8pGiUfgfrfKgRUtdXO2usvNLMZMiYn8JfplikKpd O7HnlIPeimy/8ufOGbvjxmkZNkzO8+ZqOpjuHR0ErV40L0j+BKMog6MJR2zkF+Lcg0W3 fHGLBAjOadYuw/LKOxHHIEb/9raSTifJPooKY= MIME-Version: 1.0 Received: by 10.68.19.225 with SMTP id i1mr39790711pbe.63.1318275990866; Mon, 10 Oct 2011 12:46:30 -0700 (PDT) Received: by 10.143.99.8 with HTTP; Mon, 10 Oct 2011 12:46:30 -0700 (PDT) In-Reply-To: <4E9337DA.3010406@apache.org> References: <4E9337DA.3010406@apache.org> Date: Mon, 10 Oct 2011 12:46:30 -0700 Message-ID: Subject: Re: Giraph will fail while using more workers From: Zhiwei Gu To: Avery Ching Cc: giraph-user@incubator.apache.org Content-Type: multipart/alternative; boundary=bcaec530423b26cb5804aef70c81 --bcaec530423b26cb5804aef70c81 Content-Type: text/plain; charset=ISO-8859-1 Thank you girapher, I'll try the latest version, and report the result later. 2011/10/10 Avery Ching > Hi Zhiwei, > > The issue (known) is basically from here: > > 2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:597) > at java.lang.UNIXProcess$1.run(UNIXProcess.java:141) > at java.security.AccessController.doPrivileged(Native Method) > > It has been addressed to in GIRAPH-12 ( > https://issues.apache.org/jira/browse/GIRAPH-12). > > > > Currently every worker will start up a thread to communicate with every > other workers. Hadoop RPC is used for communication. For instance if there > are 400 workers, each worker will create 400 threads. This ends up using a > lot of memory on the stack per worker, even with the option > > -Dmapred.child.java.opts="-Xss64k". > > > > It would be good if you could try the latest Apache Giraph instead of the > older one at Yahoo!, then you need to set GiraphJob.MSG_NUM_FLUSH_THREADS > (giraph.msgNumFlushThreads) to a value that won't cause you to run out of > stack space. > > Avery > On 10/10/11 11:08 AM, Zhiwei Gu wrote: > > Hi all, > In my giraph job, when I set the worker to be 200, it is ok, and while > set to 500, it will fail due to early stage OOM exception in one (or more) > workers. As this worker fails, other workers who wants to talk with this > worker will keep on waiting until tried 5 times, then that worker will fail. > > Have you ever faced such issue? > > Best, > -z > > > Here is the exception, > 2011-10-08 09:26:59,108 INFO org.apache.giraph.comm.RPCCommunications: > getRPCServer: Added jobToken Ident: 17 6a 6f 62 5f 32 30 31 31 30 38 32 36 > 30 39 31 31 5f 36 36 37 30 39 30, Pass: 12 26 1a f1 d2 51 e1 bf 2d 36 63 11 > 26 18 17 3d 53 b3 15 f6, Kind: mapreduce.job, Service: > job_201108260911_667090 > > 2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader > 2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader > 2011-10-08 09:26:59,120 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcDetailedActivityForPort31250 registered. > 2011-10-08 09:26:59,121 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcActivityForPort31250 registered. > 2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting > 2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 31250: starting > 2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 31250: starting > 2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 31250: starting > 2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 31250: starting > 2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 31250: starting > 2011-10-08 09:26:59,137 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 11 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 12 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 13 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 14 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 15 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 16 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 17 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 18 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 19 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 20 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 21 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 22 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 24 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 25 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 26 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 27 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 28 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 29 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 30 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 31 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 32 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 33 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 35 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 36 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 37 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 38 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 39 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 40 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 41 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 42 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 43 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 44 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 45 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 46 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 47 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 48 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 49 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 50 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 51 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 52 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 54 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 55 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 56 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 57 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 58 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 59 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 60 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 61 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 62 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 63 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 64 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 65 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 66 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 67 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 68 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 69 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 70 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 71 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 72 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 73 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 74 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 75 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 76 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 77 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 78 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 79 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 80 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 81 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 82 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 83 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 84 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 85 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 86 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 88 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 89 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 90 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 91 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 92 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 93 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 94 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 95 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 96 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 97 on 31250: starting > 2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler 98 on 31250: starting > 2011-10-08 09:26:59,161 INFO org.apache.giraph.comm.BasicRPCCommunications: BasicRPCCommunications: Started RPC communication server: gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250 with 100 handlers > 2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler 99 on 31250: starting > 2011-10-08 09:27:05,234 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400 > 2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:597) > at java.lang.UNIXProcess$1.run(UNIXProcess.java:141) > at java.security.AccessController.doPrivileged(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:103) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:200) > at org.apache.hadoop.util.Shell.run(Shell.java:182) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) > at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540) > at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37) > at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417) > at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400) > at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275) > at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124) > at org.apache.hadoop.mapred.Child$4.run(Child.java:266) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:255) > > 2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... > 2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source ugi(org.apache.hadoop.security.UgiInstrumentation) > 2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source jvm(org.apache.hadoop.metrics2.source.JvmMetricsSource) > 2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source RpcDetailedActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation$Detailed) > 2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source RpcActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation) > 2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped. > > > -- > Best Regards > Zhiwei Gu > > > > -- Best Regards Zhiwei Gu --bcaec530423b26cb5804aef70c81 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thank you girapher, I'll try the latest version, and report the result = later.

2011/10/10 Avery Ching <aching@apache.org>=
=20 =20 =20
Hi Zhiwei,

The issue (known) is basically from here:
2011-10-08 09:27:05,236 FATAL org.apache.h=
adoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unabl=
e to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:597)
	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
	at java.security.AccessController.doPrivileged(Native Method)
It has been addressed to in GIRAPH-12 (https://issues.apache.org/jir= a/browse/GIRAPH-12).


<snip>
Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory on the stack per worker, even with the option

-Dmapred.child.java.opts=3D"-Xss64k".
</snip>


It would be good if you could try the latest Apache Giraph instead of the older one at Yahoo!, then you need to set GiraphJob.MSG_NUM_FLUSH_THREADS (giraph.msgNumFlushThreads) to a value that won't cause you to run out of stack space.

Avery

On 10/10/11 11:08 AM, Zhiwei Gu wrote:
Hi al= l,
=A0 In my giraph job, when I set the worker to be 200, it is ok, and while set to 500, it will fail due to early stage OOM exception in one (or more) workers. As this worker fails, other workers who wants to talk with this worker will keep on waiting until tried 5 times, then that worker will fail.

Have you ever faced such issue?

Best,
-z


Here is the exception,
2011-= 10-08 09:26:59,108 INFO org.apache.giraph.comm.RPCCommunications: getRPCServer: Added jobToken Ident: 17 6a 6f 62 5f 32 30 31 31 30 38 32 36 30 39 31 31 5f 36 36 37 30 39 30, Pass: 12 26 1a f1 d2 51 e1 bf 2d 36 63 11 26 18 17 3d 53 b3 15 f6, Kind: mapreduce.job, Service: job_201108260911_667090
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Sta=
rting SocketReader
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketR=
eader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketR=
eader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketR=
eader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketR=
eader
2011-10-08 09:26:59,120 INFO org.apache.hadoop.metrics2.impl.MetricsSourceA=
dapter: MBean for source RpcDetailedActivityForPort31250 registered.
2011-10-08 09:26:59,121 INFO org.apache.hadoop.metrics2.impl.MetricsSourceA=
dapter: MBean for source RpcActivityForPort31250 registered.
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server Respo=
nder: starting
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server liste=
ner on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 0 on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 1 on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 2 on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 3 on 31250: starting
2011-10-08 09:26:59,137 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 4 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 5 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 6 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 7 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 8 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 9 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 10 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 11 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 12 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 13 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 14 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 15 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 16 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 17 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 18 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 19 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 20 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 21 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 22 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 23 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 24 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 25 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 26 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 27 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 28 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 29 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 30 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 31 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 32 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 33 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 34 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 35 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 36 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 37 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 38 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 39 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 40 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 41 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 42 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 43 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 44 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 45 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 46 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 47 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 48 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 49 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 50 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 51 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 52 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 53 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 54 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 55 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 56 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 57 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 58 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 59 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 60 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 61 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 62 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 63 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 64 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 65 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 66 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 67 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 68 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 69 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 70 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 71 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 72 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 73 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 74 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 75 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 76 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 77 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 78 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 79 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 80 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 81 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 82 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 83 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 84 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 85 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 86 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 87 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 88 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 89 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 90 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 91 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 92 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 93 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 94 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 95 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 96 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 97 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 98 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.giraph.comm.BasicRPCCommunications:=
 BasicRPCCommunications: Started RPC communication server: gsta=
33033.tan.ygrid.yahoo.com/10.216.176.59:31250 with 100 handlers
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handl=
er 99 on 31250: starting
2011-10-08 09:27:05,234 INFO org.apache.hadoop.mapred.TaskLogsTruncater: In=
itializing logs' truncater with mapRetainSize=3D102400 and reduceRetain=
Size=3D102400
2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running=
 child : java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:597)
	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
	at org.apache.hadoop.util.Shell.run(Shell.java:182)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:37=
5)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.=
java:540)
	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.j=
ava:37)
	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissi=
onInfo(RawLocalFileSystem.java:417)
	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(Raw=
LocalFileSystem.java:400)
	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncat=
er.java:124)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1059)
	at org.apache.hadoop.mapred.Child.main(Child.java:255)

2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemI=
mpl: Stopping MapTask metrics system...
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemI=
mpl: Stopping metrics source ugi(org.apache.hadoop.security.UgiInstrumentat=
ion)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemI=
mpl: Stopping metrics source jvm(org.apache.hadoop.metrics2.source.JvmMetri=
csSource)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemI=
mpl: Stopping metrics source RpcDetailedActivityForPort31250(org.apache.had=
oop.ipc.metrics.RpcInstrumentation$Detailed)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemI=
mpl: Stopping metrics source RpcActivityForPort31250(org.apache.hadoop.ipc.=
metrics.RpcInstrumentation)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemI=
mpl: MapTask metrics system stopped.

--
Best Regards
Zhiwei Gu





--
Best Regards=
Zhiwei Gu
--bcaec530423b26cb5804aef70c81--