giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Dudinski <denis.dudin...@gmail.com>
Subject Re: Out of core computation fails with KryoException: Buffer underflow
Date Tue, 08 Nov 2016 14:48:36 GMT
Hi Hai,

I think we saw something like this in our environment.

Interesting row is this one:
2016-10-27 19:04:00,000 INFO [SessionTracker]
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x158084f5b2100b8, timeout of 600000ms exceeded

I think that one of workers due to some reason did not communicate
with ZooKeeper for quite a long time (it may be heavy network load or
high CPU consumption, see your monitoring infrastructure, it should
give you a hint). ZooKeeper session expires and all ephemeral nodes
for that worker in ZooKeeper tree are deleted. Master thinks that
worker is dead and halts computation.

Your ZooKeeper session timeout is 600000 ms which is 10 minutes. We
set this value to much more higher value equal to 1 hour and were able
to perform computations successfully.

I hope it will help in your case too.

Best Regards,
Denis Dudinski

2016-11-08 16:43 GMT+03:00 Hai Lan <lanhai1988@gmail.com>:
> Hi Guys
>
> The OutOfMemoryError might be solved be adding
> "-Dmapreduce.map.memory.mb=14848". But in my tests, I found some more
> problems during running out of core graph.
>
> I did two tests with 150G 10^10 vertices input in 1.2 version, and it seems
> like it not necessary to add like
> "giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1" cause it is
> adaptive. However, If I run without setting "userPartitionCount and
> maxPartitionsInMemory", it will it will keep running on superstep -1
> forever. None of worker can finish superstep -1. And I can see a warn in
> zookeeper log, not sure if it is the problem:
>
> WARN [netty-client-worker-3]
> org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught:
> Channel failed with remote address
> trantor21.umiacs.umd.edu/192.168.74.221:30172
> java.lang.ArrayIndexOutOfBoundsException: 1075052544
> 	at
> org.apache.giraph.comm.flow_control.NoOpFlowControl.getAckSignalFlag(NoOpFlowControl.java:52)
> 	at
> org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:796)
> 	at
> org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
> 	at
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> 	at
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> 	at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
> 	at
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> 	at
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> 	at
> org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:74)
> 	at
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> 	at
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> 	at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
> 	at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
> 	at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
> 	at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
> 	at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
> 	at java.lang.Thread.run(Thread.java:745)
>
>
>
> If I add giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1.
> Whole command is :
>
> hadoop jar
> /home/hlan/giraph-1.2.0-hadoop2/giraph-examples/target/giraph-examples-1.2.0-hadoop2-for-hadoop-2.6.0-jar-with-dependencies.jar
> org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreGraph=true
> -Ddigraph.block_factory_configurators=org.apache.giraph.conf.FacebookConfiguration
> -Dmapreduce.map.memory.mb=14848 org.apache.giraph.examples.myTask -vif
> org.apache.giraph.examples.LongFloatNullTextInputFormat -vip
> /user/hlan/cube/tmp/out/ -vof
> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
> /user/hlan/output -w 199 -ca
> mapred.job.tracker=localhost:5431,steps=6,giraph.isStaticGraph=true,giraph.numInputThreads=10,giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1
>
> the job will be pass superstep -1 very quick (around 10 mins). But it will
> be killed near end of superstep 0.
>
> 2016-10-27 18:53:56,607 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Vertices
> - Mean: 9810049, Min: Worker(hostname=trantor11.umiacs.umd.edu
> hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 9771533, Max:
> Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu,
> MRtaskID=49, port=30049) - 9995724
> 2016-10-27 18:53:56,608 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Edges -
> Mean: 0, Min: Worker(hostname=trantor11.umiacs.umd.edu
> hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 0, Max:
> Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu,
> MRtaskID=49, port=30049) - 0
> 2016-10-27 18:53:56,634 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:54:26,638 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:54:56,640 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:55:26,641 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:55:56,642 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:56:26,643 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:56:56,644 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:57:26,645 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:57:56,646 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:58:26,647 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:58:56,675 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:59:26,676 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 18:59:56,677 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:00:26,678 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:00:56,679 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:01:26,680 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:01:29,610 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x158084f5b2100c6, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> at java.lang.Thread.run(Thread.java:745)
> 2016-10-27 19:01:29,612 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> client /192.168.74.212:53136 which had sessionid 0x158084f5b2100c6
> 2016-10-27 19:01:31,702 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection
> from /192.168.74.212:56696
> 2016-10-27 19:01:31,711 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> session 0x158084f5b2100c6 at /192.168.74.212:56696
> 2016-10-27 19:01:31,712 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session
> 0x158084f5b2100c6 with negotiated timeout 600000 for client
> /192.168.74.212:56696
> 2016-10-27 19:01:56,681 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:02:20,029 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x158084f5b2100c5, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> at java.lang.Thread.run(Thread.java:745)
> 2016-10-27 19:02:20,030 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> client /192.168.74.212:53134 which had sessionid 0x158084f5b2100c5
> 2016-10-27 19:02:21,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection
> from /192.168.74.212:56718
> 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> session 0x158084f5b2100c5 at /192.168.74.212:56718
> 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session
> 0x158084f5b2100c5 with negotiated timeout 600000 for client
> /192.168.74.212:56718
> 2016-10-27 19:02:26,682 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:02:56,683 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:03:05,743 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x158084f5b2100b9, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> at java.lang.Thread.run(Thread.java:745)
> 2016-10-27 19:03:05,744 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> client /192.168.74.203:51130 which had sessionid 0x158084f5b2100b9
> 2016-10-27 19:03:07,452 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection
> from /192.168.74.203:54676
> 2016-10-27 19:03:07,493 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> session 0x158084f5b2100b9 at /192.168.74.203:54676
> 2016-10-27 19:03:07,494 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session
> 0x158084f5b2100b9 with negotiated timeout 600000 for client
> /192.168.74.203:54676
> 2016-10-27 19:03:26,684 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:03:53,712 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x158084f5b2100be, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> at java.lang.Thread.run(Thread.java:745)
> 2016-10-27 19:03:53,713 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> client /192.168.74.203:51146 which had sessionid 0x158084f5b2100be
> 2016-10-27 19:03:55,436 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection
> from /192.168.74.203:54694
> 2016-10-27 19:03:55,482 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> session 0x158084f5b2100be at /192.168.74.203:54694
> 2016-10-27 19:03:55,483 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session
> 0x158084f5b2100be with negotiated timeout 600000 for client
> /192.168.74.203:54694
> 2016-10-27 19:03:56,719 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199
> workers finished on superstep 0 on path
> /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
> 2016-10-27 19:04:00,000 INFO [SessionTracker]
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x158084f5b2100b8, timeout of 600000ms exceeded
> 2016-10-27 19:04:00,001 INFO [SessionTracker]
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x158084f5b2100c2, timeout of 600000ms exceeded
> 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
> org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> termination for sessionid: 0x158084f5b2100b8
> 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
> org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> termination for sessionid: 0x158084f5b2100c2
> 2016-10-27 19:04:00,004 INFO [SyncThread:0]
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> client /192.168.74.203:51116 which had sessionid 0x158084f5b2100b8
> 2016-10-27 19:04:00,006 INFO [SyncThread:0]
> org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> client /192.168.74.212:53128 which had sessionid 0x158084f5b2100c2
> 2016-10-27 19:04:00,033 INFO [org.apache.giraph.master.MasterThread]
> org.apache.giraph.master.BspServiceMaster: setJobState:
> {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1} on
> superstep 0
>
> Any Idea about this?
>
> Thanks,
>
> Hai
>
>
> On Tue, Nov 8, 2016 at 6:37 AM, Denis Dudinski <denis.dudinski@gmail.com>
> wrote:
>>
>> Hi Xenia,
>>
>> Thank you! I'll check the thread you mentioned.
>>
>> Best Regards,
>> Denis Dudinski
>>
>> 2016-11-08 14:16 GMT+03:00 Xenia Demetriou <xeniad20@gmail.com>:
>> > Hi Denis,
>> >
>> > For the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error
>> > I
>> > hope that the  conversation in below link can help you.
>> >  www.mail-archive.com/user@giraph.apache.org/msg02938.html
>> >
>> > Regards,
>> > Xenia
>> >
>> > 2016-11-08 12:25 GMT+02:00 Denis Dudinski <denis.dudinski@gmail.com>:
>> >>
>> >> Hi Hassan,
>> >>
>> >> Thank you for really quick response!
>> >>
>> >> I changed "giraph.isStaticGraph" to false and the error disappeared.
>> >> As expected iteration became slow and wrote to disk edges once again
>> >> in superstep 1.
>> >>
>> >> However, the computation failed at superstep 2 with error
>> >> "java.lang.OutOfMemoryError: GC overhead limit exceeded". It seems to
>> >> be unrelated to "isStaticGraph" issue, but I think it worth mentioning
>> >> to see the picture as a whole.
>> >>
>> >> Are there any other tests/information I am able to execute/check to
>> >> help to pinpoint "isStaticGraph" problem?
>> >>
>> >> Best Regards,
>> >> Denis Dudinski
>> >>
>> >>
>> >> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.eslami@gmail.com>:
>> >> > Hi Denis,
>> >> >
>> >> > Thanks for bringing up the issue. In the previous conversation
>> >> > thread,
>> >> > the
>> >> > similar problem is reported even with a simpler example connected
>> >> > component
>> >> > calculation. Although, back then, we were developing other
>> >> > performance-critical components of OOC.
>> >> >
>> >> > Let's debug this issue together to make the new OOC more stable. I
>> >> > suspect
>> >> > the problem is with "giraph.isStaticGraph=true" (as this is only an
>> >> > optimization and most of our end-to-end testing was on cases where
>> >> > the
>> >> > graph
>> >> > could change). Let's get rid of it for now and see if the problem
>> >> > still
>> >> > exists.
>> >> >
>> >> > Best,
>> >> > Hassan
>> >> >
>> >> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski
>> >> > <denis.dudinski@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hello,
>> >> >>
>> >> >> We are trying to calculate PageRank on huge graph, which does not
>> >> >> fit
>> >> >> into memory. For calculation to succeed we tried to turn on
>> >> >> OutOfCore
>> >> >> feature of Giraph, but every launch we tried resulted in
>> >> >> com.esotericsoftware.kryo.KryoException: Buffer underflow.
>> >> >> Each time it happens on different servers but exactly right after
>> >> >> start of superstep 1.
>> >> >>
>> >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't
>> >> >> back-step to Giraph's officially supported version and had to patch
>> >> >> Giraph a little) placed on 11 servers + 3 master servers (namenodes
>> >> >> etc.) with separate ZooKeeper cluster deployment.
>> >> >>
>> >> >> Our launch command:
>> >> >>
>> >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar
>> >> >> org.apache.giraph.GiraphRunner
>> >> >> com.prototype.di.pr.PageRankComputation
>> >> >> \
>> >> >> -mc com.prototype.di.pr.PageRankMasterCompute \
>> >> >> -yj pr-job-jar-with-dependencies.jar \
>> >> >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \
>> >> >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
>> >> >> -op /user/hadoop/output/pr_test \
>> >> >> -w 10 \
>> >> >> -c com.prototype.di.pr.PRDoubleCombiner \
>> >> >> -wc com.prototype.di.pr.PageRankWorkerContext \
>> >> >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
>> >> >> -ca giraph.logLevel=info \
>> >> >> -ca hbase.mapreduce.inputtable=di_test \
>> >> >> -ca hbase.mapreduce.scan.columns=di:n \
>> >> >> -ca hbase.defaults.for.version.skip=true \
>> >> >> -ca hbase.table.row.textkey=false \
>> >> >> -ca giraph.yarn.task.heap.mb=48000 \
>> >> >> -ca giraph.isStaticGraph=true \
>> >> >> -ca giraph.SplitMasterWorker=false \
>> >> >> -ca giraph.oneToAllMsgSending=true \
>> >> >> -ca giraph.metrics.enable=true \
>> >> >> -ca giraph.jmap.histo.enable=true \
>> >> >> -ca
>> >> >> giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable
>> >> >> \
>> >> >> -ca
>> >> >> giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable
>> >> >> \
>> >> >> -ca
>> >> >> giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges
>> >> >> \
>> >> >> -ca giraph.useOutOfCoreGraph=true \
>> >> >> -ca giraph.waitForPerWorkerRequests=true \
>> >> >> -ca giraph.maxNumberOfUnsentRequests=1000 \
>> >> >> -ca
>> >> >>
>> >> >>
>> >> >> giraph.vertexInputFilterClass=com.prototype.di.pr.input.PagesFromSameDomainLimiter
>> >> >> \
>> >> >> -ca giraph.useInputSplitLocality=true \
>> >> >> -ca hbase.mapreduce.scan.cachedrows=10000 \
>> >> >> -ca giraph.minPartitionsPerComputeThread=60 \
>> >> >> -ca
>> >> >>
>> >> >>
>> >> >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr.DomainAwareGraphPartitionerFactory
>> >> >> \
>> >> >> -ca giraph.numInputThreads=1 \
>> >> >> -ca giraph.inputSplitSamplePercent=20 \
>> >> >> -ca giraph.pr.maxNeighborsPerVertex=50 \
>> >> >> -ca
>> >> >> giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition
>> >> >> \
>> >> >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex
\
>> >> >> -ca
>> >> >>
>> >> >>
>> >> >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/_bsp/_partitions
>> >> >>
>> >> >> Logs excerpt:
>> >> >>
>> >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep
in
>> >> >> worker
>> >> >> context
>> >> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60
>> >> >> partitions
>> >> >> to process with 1 compute thread(s), originally 1 thread(s) on
>> >> >> superstep 1
>> >> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with
60
>> >> >> partitions in memory and 1 active threads
>> >> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in
PR
>> >> >> computation
>> >> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.75
>> >> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
>> >> >> updateActiveThreadsFraction: updating the number of active threads
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
>> >> >> updateRequestsCredit: updating the credit to 20
>> >> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring:
>> >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
>> >> >> Failure, duration = 937ms
>> >> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.72
>> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.74
>> >> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
>> >> >> updateActiveThreadsFraction: updating the number of active threads
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
>> >> >> updateRequestsCredit: updating the credit to 20
>> >> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.76
>> >> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
>> >> >> processing partition 234 is done!
>> >> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.79
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
>> >> >> updateActiveThreadsFraction: updating the number of active threads
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>> >> >> updateRequestsCredit: updating the credit to 18
>> >> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window
>> >> >> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459,
ave
>> >> >> received req MBytes = 0.0138, secs waited = 30.058
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.82
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
>> >> >> next
>> >> >> IO command is: StorePartitionIOCommand: (partitionId = 234)
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
>> >> >> command StorePartitionIOCommand: (partitionId = 234) completed:
>> >> >> bytes=
>> >> >> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding
GC
>> >> >> time)=175.03
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.83
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
>> >> >> next
>> >> >> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234)
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
>> >> >> command StoreIncomingMessageIOCommand: (partitionId = 234)
>> >> >> completed:
>> >> >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC
>> >> >> time)=NaN
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.83
>> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
>> >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
>> >> >> Failure, duration = 3107ms
>> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
>> >> >> name = PS MarkSweep, action = end of major GC, cause = Ergonomics,
>> >> >> duration = 15064ms
>> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
>> >> >> updateActiveThreadsFraction: updating the number of active threads
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
>> >> >> updateRequestsCredit: updating the credit to 20
>> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction = 0.71
>> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's
>> >> >> next
>> >> >> IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep
>> >> >> =
>> >> >> 2)
>> >> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016
>> >> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
>> >> >> processing partition 364 is done!
>> >> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
>> >> >> updateActiveThreadsFraction: updating the number of active threads
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
>> >> >> updateRequestsCredit: updating the credit to 20
>> >> >> --
>> >> >> -- num     #instances         #bytes  class name
>> >> >> -- ----------------------------------------------
>> >> >> --   1:     224004229    10752202992
>> >> >> java.util.concurrent.ConcurrentHashMap$Node
>> >> >> --   2:      19751666     6645730528  [B
>> >> >> --   3:     222135985     5331263640
>> >> >> com.belprime.di.pr.DomainPartAwareLongWritable
>> >> >> --   4:     214686483     5152475592
>> >> >> org.apache.hadoop.io.DoubleWritable
>> >> >> --   5:           353     4357261784
>> >> >> [Ljava.util.concurrent.ConcurrentHashMap$Node;
>> >> >> --   6:        486266      204484688  [I
>> >> >> --   7:       6017652      192564864
>> >> >> org.apache.giraph.utils.UnsafeByteArrayOutputStream
>> >> >> --   8:       3986203      159448120
>> >> >> org.apache.giraph.utils.UnsafeByteArrayInputStream
>> >> >> --   9:       2064182      148621104
>> >> >> org.apache.giraph.graph.ByteValueVertex
>> >> >> --  10:       2064182       82567280
>> >> >> org.apache.giraph.edge.ByteArrayEdges
>> >> >> --  11:       1886875       45285000  java.lang.Integer
>> >> >> --  12:        349409       30747992
>> >> >> java.util.concurrent.ConcurrentHashMap$TreeNode
>> >> >> --  13:        916970       29343040  java.util.Collections$1
>> >> >> --  14:        916971       22007304
>> >> >> java.util.Collections$SingletonSet
>> >> >> --  15:         47270        3781600
>> >> >> java.util.concurrent.ConcurrentHashMap$TreeBin
>> >> >> --  16:         26201        2590912  [C
>> >> >> --  17:         34175        1367000
>> >> >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
>> >> >> --  18:          6143        1067704  java.lang.Class
>> >> >> --  19:         25953         830496  java.lang.String
>> >> >> --  20:         34175         820200
>> >> >> org.apache.giraph.edge.EdgeNoValue
>> >> >> --  21:          4488         703400  [Ljava.lang.Object;
>> >> >> --  22:            70         395424
>> >> >> [Ljava.nio.channels.SelectionKey;
>> >> >> --  23:          2052         328320  java.lang.reflect.Method
>> >> >> --  24:          6600         316800
>> >> >> org.apache.giraph.utils.ByteArrayVertexIdMessages
>> >> >> --  25:          5781         277488  java.util.HashMap$Node
>> >> >> --  26:          5651         271248  java.util.Hashtable$Entry
>> >> >> --  27:          6604         211328
>> >> >> org.apache.giraph.factories.DefaultMessageValueFactory
>> >> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution
of
>> >> >> callable failed
>> >> >> java.lang.RuntimeException: call: execution of IO command
>> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
>> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> >> >> at java.lang.Thread.run(Thread.java:745)
>> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
>> >> >> underflow.
>> >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
>> >> >> at
>> >> >>
>> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91)
>> >> >> at
>> >> >> org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99)
>> >> >> ... 6 more
>> >> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException:
>> >> >> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution
>> >> >> of
>> >> >> IO command LoadPartitionIOCommand: (partitionId = 234, superstep
=
>> >> >> 2)
>> >> >> failed!, exiting...
>> >> >> java.lang.RuntimeException: call: execution of IO command
>> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
>> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> >> >> at java.lang.Thread.run(Thread.java:745)
>> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
>> >> >> underflow.
>> >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
>> >> >> at
>> >> >>
>> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91)
>> >> >> at
>> >> >> org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99)
>> >> >> ... 6 more
>> >> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth:
>> >> >> Got
>> >> >> failure, unregistering health on
>> >> >>
>> >> >>
>> >> >>
>> >> >> /_hadoopBsp/giraph_yarn_application_1478342673283_0009/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir/datanode6.webmeup.com_5
>> >> >> on superstep 1
>> >> >>
>> >> >> We looked into one thread
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%3DoLJAAPQw%40mail.gmail.com%3E
>> >> >> but it is rather old and at that time the answer was "do not use
it
>> >> >> yet".
>> >> >> (see reply
>> >> >>
>> >> >>
>> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%40mail.gmail.com%3E).
>> >> >> Does it hold today? We would like to use new advanced adaptive
OOC
>> >> >> approach if possible...
>> >> >>
>> >> >> Thank you in advance, any help or hint would be really appreciated.
>> >> >>
>> >> >> Best Regards,
>> >> >> Denis Dudinski
>> >> >
>> >> >
>> >
>> >
>
>


Mime
View raw message