giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arghya Kusum Das <arghyakusumdas2...@gmail.com>
Subject Giraph job is failing on 128 node cluster. Seems only one worker failure is causing the entire job failure
Date Sun, 16 Nov 2014 05:53:51 GMT
Hi,

My Giraph job works fine in smaller number of nodes.
But when trying to run it on 128 nodes cluster I am getting the following
error.
It seems that only one worker failure is causing the entire job failure.
I attached the error messages from master and failed worker log.
Any help is appreciated


[MASTER LOG]
2014-11-15 23:01:45,305 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: (waiting for rest of workers) ALL_EXCEPT_ZOOKEEPER -
Attempt=0, Superstep=59
2014-11-15 23:01:46,169 FATAL org.apache.giraph.graph.GraphMapper:
uncaughtException: OverrideExceptionHandler on thread
org.apache.giraph.master.MasterThread, msg = unable to create new native
thread, exiting...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:691)
at
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:943)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1336)
at java.lang.UNIXProcess.initStreams(UNIXProcess.java:172)
at java.lang.UNIXProcess$2.run(UNIXProcess.java:145)
at java.lang.UNIXProcess$2.run(UNIXProcess.java:143)
at java.security.AccessController.doPrivileged(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:143)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
at java.lang.Runtime.exec(Runtime.java:615)
at java.lang.Runtime.exec(Runtime.java:448)
at java.lang.Runtime.exec(Runtime.java:345)
at pga.MasterVertex.compute(MasterVertex.java:242)
at
org.apache.giraph.master.BspServiceMaster.doMasterCompute(BspServiceMaster.java:1691)
at
org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1627)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)

[FAILED WORKER LOG]
2014-11-15 23:11:46,281 WARN org.apache.giraph.comm.netty.NettyServer:
start: Likely failed to bind on attempt 0 to port 30007
org.jboss.netty.channel.ChannelException: Failed to bind to: qb114/
208.100.93.114:30007
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
at org.apache.giraph.comm.netty.NettyServer.start(NettyServer.java:326)
at
org.apache.giraph.comm.netty.NettyMasterServer.<init>(NettyMasterServer.java:49)
at
org.apache.giraph.master.BspServiceMaster.becomeMaster(BspServiceMaster.java:877)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:98)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:344)
at sun.nio.ch.Net.bind(Net.java:336)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.bind(NioServerSocketPipelineSink.java:138)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleServerSocket(NioServerSocketPipelineSink.java:90)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:64)
at org.jboss.netty.channel.Channels.bind(Channels.java:569)
at org.jboss.netty.channel.AbstractChannel.bind(AbstractChannel.java:187)
at
org.jboss.netty.bootstrap.ServerBootstrap$Binder.channelOpen(ServerBootstrap.java:343)
at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)
at
org.jboss.netty.channel.socket.nio.NioServerSocketChannel.<init>(NioServerSocketChannel.java:80)
at
org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:158)
at
org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:86)
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:277)
... 4 more
2014-11-15 23:11:46,305 INFO org.apache.giraph.comm.netty.NettyServer:
start: Started server communication server: qb114/208.100.93.114:31007 with
up to 16 threads on bind attempt 1 with sendBufferSize = 32768
receiveBufferSize = 524288 backlog = 874
2014-11-15 23:11:46,325 INFO org.apache.giraph.comm.netty.NettyClient:
NettyClient: Using execution handler with 8 threads after requestEncoder.
2014-11-15 23:11:46,325 INFO org.apache.giraph.master.BspServiceMaster:
becomeMaster: I am now the master!
2014-11-15 23:11:46,326 INFO org.apache.giraph.master.BspServiceMaster:
/_hadoopBsp/job_201411152123_0003/_vertexInputSplitDir already exists, no
need to create
2014-11-15 23:11:46,326 ERROR org.apache.giraph.master.MasterThread:
masterThread: Master algorithm failed with NullPointerException
java.lang.NullPointerException
at java.lang.String.<init>(String.java:505)
at
org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:600)
at
org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:696)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:100)
2014-11-15 23:11:46,327 FATAL org.apache.giraph.graph.GraphMapper:
uncaughtException: OverrideExceptionHandler on thread
org.apache.giraph.master.MasterThread, msg =
java.lang.NullPointerException, exiting...
java.lang.IllegalStateException: java.lang.NullPointerException
at org.apache.giraph.master.MasterThread.run(MasterThread.java:185)
Caused by: java.lang.NullPointerException
at java.lang.String.<init>(String.java:505)
at
org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:600)
at
org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:696)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:100)


-- 
Thanks and regards,
Arghya Kusum Das
(225-362-4031)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message