Return-Path: X-Original-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A36177BF6 for ; Mon, 10 Oct 2011 18:20:52 +0000 (UTC) Received: (qmail 29623 invoked by uid 500); 10 Oct 2011 18:20:52 -0000 Delivered-To: apmail-incubator-giraph-user-archive@incubator.apache.org Received: (qmail 29593 invoked by uid 500); 10 Oct 2011 18:20:52 -0000 Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-user@incubator.apache.org Delivered-To: mailing list giraph-user@incubator.apache.org Received: (qmail 29585 invoked by uid 99); 10 Oct 2011 18:20:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Oct 2011 18:20:52 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jghoman@gmail.com designates 74.125.82.175 as permitted sender) Received: from [74.125.82.175] (HELO mail-wy0-f175.google.com) (74.125.82.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Oct 2011 18:20:45 +0000 Received: by wyh5 with SMTP id 5so6630299wyh.6 for ; Mon, 10 Oct 2011 11:20:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=rHa6x6fir0iOuEeL+XVU9URkbv2559hpUxkr+1qlres=; b=sIj2v7T/NFslDwlSwKmja4CqC1RsVx/7MSr5QMSXa2qAffb15X/0RmTenqtjS1Q7Tv 5YTfiig6/qTqB544/o6E1yvfubaszqzp5G+NQB07Rsk2BFfZ1DwhL7P0qBGI1nAlbj5W UeMzHae1wR4j4nBVtWCdZXx+oAQzTT6I5YDzc= Received: by 10.227.38.200 with SMTP id c8mr6464535wbe.113.1318270824391; Mon, 10 Oct 2011 11:20:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.180.95.36 with HTTP; Mon, 10 Oct 2011 11:19:54 -0700 (PDT) In-Reply-To: <2A9510BF-92C8-4B4C-B52A-E04E8DA54C54@jybe-inc.com> References: <2A9510BF-92C8-4B4C-B52A-E04E8DA54C54@jybe-inc.com> From: Jakob Homan Date: Mon, 10 Oct 2011 11:19:54 -0700 Message-ID: Subject: Re: Giraph will fail while using more workers To: giraph-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Right now, Giraph doesn't scale much past 300 workers due to its threading model. I'm almost done with a thrift/finagle version that I've taken past 1k workers. The patch should be up in the next couple days. -Jakob On Mon, Oct 10, 2011 at 11:17 AM, Christian Kunz w= rote: > Did you try something like > -Dmapred.child.java.opts=3D"-Xss64k? > (see GIRAPH-12) > Christian > On Oct 10, 2011, at 11:08 AM, Zhiwei Gu wrote: > > Hi all, > =A0 In my giraph job, when I set the worker to be 200, it is ok, and whil= e set > to 500, it will fail due to early stage OOM exception in one (or more) > workers. As this worker fails, other workers who wants to talk with this > worker will keep on waiting until tried 5 times, then that worker will fa= il. > Have you ever faced such issue? > Best, > -z > > Here is the exception, > 2011-10-08 09:26:59,108 INFO org.apache.giraph.comm.RPCCommunications: > getRPCServer: Added jobToken Ident: 17 6a 6f 62 5f 32 30 31 31 30 38 32 3= 6 > 30 39 31 31 5f 36 36 37 30 39 30, Pass: 12 26 1a f1 d2 51 e1 bf 2d 36 63 = 11 > 26 18 17 3d 53 b3 15 f6, Kind: mapreduce.job, Service: > job_201108260911_667090 > > 2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,120 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > RpcDetailedActivityForPort31250 registered. > 2011-10-08 09:26:59,121 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > RpcActivityForPort31250 registered. > 2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server > Responder: starting > 2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server > listener on 31250: starting > 2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 0 on 31250: starting > 2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 1 on 31250: starting > 2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 2 on 31250: starting > 2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 3 on 31250: starting > 2011-10-08 09:26:59,137 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 4 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 5 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 6 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 7 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 8 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 9 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 10 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 11 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 12 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 13 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 14 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 15 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 16 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 17 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 18 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 19 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 20 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 21 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 22 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 23 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 24 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 25 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 26 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 27 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 28 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 29 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 30 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 31 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 32 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 33 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 34 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 35 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 36 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 37 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 38 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 39 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 40 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 41 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 42 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 43 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 44 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 45 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 46 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 47 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 48 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 49 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 50 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 51 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 52 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 53 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 54 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 55 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 56 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 57 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 58 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 59 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 60 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 61 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 62 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 63 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 64 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 65 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 66 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 67 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 68 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 69 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 70 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 71 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 72 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 73 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 74 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 75 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 76 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 77 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 78 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 79 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 80 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 81 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 82 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 83 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 84 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 85 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 86 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 87 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 88 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 89 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 90 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 91 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 92 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 93 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 94 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 95 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 96 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 97 on 31250: starting > 2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 98 on 31250: starting > 2011-10-08 09:26:59,161 INFO org.apache.giraph.comm.BasicRPCCommunication= s: > BasicRPCCommunications: Started RPC communication server: > gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250 with 100 handlers > 2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 99 on 31250: starting > 2011-10-08 09:27:05,234 INFO org.apache.hadoop.mapred.TaskLogsTruncater: > Initializing logs' truncater with mapRetainSize=3D102400 and > reduceRetainSize=3D102400 > 2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error runni= ng > child : java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:597) > at java.lang.UNIXProcess$1.run(UNIXProcess.java:141) > at java.security.AccessController.doPrivileged(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:103) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:200) > at org.apache.hadoop.util.Shell.run(Shell.java:182) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) > at > org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.ja= va:540) > at > org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.jav= a:37) > at > org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermission= Info(RawLocalFileSystem.java:417) > at > org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLo= calFileSystem.java:400) > at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275) > at > org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater= .java:124) > at org.apache.hadoop.mapred.Child$4.run(Child.java:266) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation= .java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:255) > > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metri= cs > system... > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics sourc= e > ugi(org.apache.hadoop.security.UgiInstrumentation) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics sourc= e > jvm(org.apache.hadoop.metrics2.source.JvmMetricsSource) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics sourc= e > RpcDetailedActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrume= ntation$Detailed) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics sourc= e > RpcActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system > stopped. > > -- > Best Regards > Zhiwei Gu > >