Hi all,
  In my giraph job, when I set the worker to be 200, it is ok, and while set to 500, it will fail due to early stage OOM exception in one (or more) workers. As this worker fails, other workers who wants to talk with this worker will keep on waiting until tried 5 times, then that worker will fail.

Have you ever faced such issue?

Best,
-z


Here is the exception,
2011-10-08 09:26:59,108 INFO org.apache.giraph.comm.RPCCommunications: getRPCServer: Added jobToken Ident: 17 6a 6f 62 5f 32 30 31 31 30 38 32 36 30 39 31 31 5f 36 36 37 30 39 30, Pass: 12 26 1a f1 d2 51 e1 bf 2d 36 63 11 26 18 17 3d 53 b3 15 f6, Kind: mapreduce.job, Service: job_201108260911_667090
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,120 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcDetailedActivityForPort31250 registered.
2011-10-08 09:26:59,121 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcActivityForPort31250 registered.
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 31250: starting
2011-10-08 09:26:59,137 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 11 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 12 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 13 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 14 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 15 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 16 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 17 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 18 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 19 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 20 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 21 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 22 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 24 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 25 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 26 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 27 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 28 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 29 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 30 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 31 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 32 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 33 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 35 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 36 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 37 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 38 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 39 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 40 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 41 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 42 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 43 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 44 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 45 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 46 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 47 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 48 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 49 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 50 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 51 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 52 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 54 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 55 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 56 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 57 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 58 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 59 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 60 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 61 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 62 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 63 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 64 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 65 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 66 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 67 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 68 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 69 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 70 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 71 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 72 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 73 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 74 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 75 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 76 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 77 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 78 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 79 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 80 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 81 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 82 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 83 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 84 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 85 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 86 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 88 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 89 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 90 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 91 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 92 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 93 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 94 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 95 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 96 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 97 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler 98 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.giraph.comm.BasicRPCCommunications: BasicRPCCommunications: Started RPC communication server: gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250 with 100 handlers
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler 99 on 31250: starting
2011-10-08 09:27:05,234 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:597)
	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
	at org.apache.hadoop.util.Shell.run(Shell.java:182)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
	at org.apache.hadoop.mapred.Child.main(Child.java:255)

2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source ugi(org.apache.hadoop.security.UgiInstrumentation)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source jvm(org.apache.hadoop.metrics2.source.JvmMetricsSource)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source RpcDetailedActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation$Detailed)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source RpcActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.

--
Best Regards
Zhiwei Gu