giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hai Lan <lanhai1...@gmail.com>
Subject Re: Out of core computation fails with KryoException: Buffer underflow
Date Tue, 08 Nov 2016 13:43:28 GMT
Hi Guys

The OutOfMemoryError might be solved be adding
"-Dmapreduce.map.memory.mb=14848". But in my tests, I found some more
problems during running out of core graph.

I did two tests with 150G 10^10 vertices input in 1.2 version, and it seems
like it not necessary to add like "giraph.userPartitionCount=1000,
giraph.maxPartitionsInMemory=1" cause it is adaptive. However, If I run
without setting "userPartitionCount and maxPartitionsInMemory", it will it
will keep running on superstep -1 forever. *None of worker can finish
superstep -1. And I can see a warn in zookeeper log, not sure if it is the
problem:*

WARN [netty-client-worker-3]
org.apache.giraph.comm.netty.handler.ResponseClientHandler:
exceptionCaught: Channel failed with remote address
trantor21.umiacs.umd.edu/192.168.74.221:30172
java.lang.ArrayIndexOutOfBoundsException: 1075052544
	at org.apache.giraph.comm.flow_control.NoOpFlowControl.getAckSignalFlag(NoOpFlowControl.java:52)
	at org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:796)
	at org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
	at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
	at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
	at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
	at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
	at org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:74)
	at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
	at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
	at java.lang.Thread.run(Thread.java:745)



If I add giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1.
Whole command is :

hadoop jar /home/hlan/giraph-1.2.0-hadoop2/giraph-examples/
target/giraph-examples-1.2.0-hadoop2-for-hadoop-2.6.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreGraph=true
-Ddigraph.block_factory_configurators=org.apache.giraph.conf.FacebookConfiguration
-Dmapreduce.map.memory.mb=14848 org.apache.giraph.examples.myTask -vif
org.apache.giraph.examples.LongFloatNullTextInputFormat -vip
/user/hlan/cube/tmp/out/ -vof
org.apache.giraph.io.formats.IdWithValueTextOutputFormat
-op /user/hlan/output -w 199 -ca mapred.job.tracker=localhost:
5431,steps=6,giraph.isStaticGraph=true,giraph.numInputThreads=10,giraph.
userPartitionCount=1000,giraph.maxPartitionsInMemory=1


*the job will be pass superstep -1 very quick (around 10 mins). But it will
be killed near end of superstep 0.*

2016-10-27 18:53:56,607 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Vertices
- Mean: 9810049, Min: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=
trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 9771533, Max:
Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu,
MRtaskID=49, port=30049) - 9995724
2016-10-27 18:53:56,608 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Edges -
Mean: 0, Min: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=
trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 0, Max:
Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu,
MRtaskID=49, port=30049) - 0
2016-10-27 18:53:56,634 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:54:26,638 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:54:56,640 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:55:26,641 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:55:56,642 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:56:26,643 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:56:56,644 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:57:26,645 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:57:56,646 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:58:26,647 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:58:56,675 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:59:26,676 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 18:59:56,677 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:00:26,678 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:00:56,679 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:01:26,680 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:01:29,610 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x158084f5b2100c6, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2016-10-27 19:01:29,612 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
client /192.168.74.212:53136 which had sessionid 0x158084f5b2100c6
2016-10-27 19:01:31,702 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
connection from /192.168.74.212:56696
2016-10-27 19:01:31,711 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
session 0x158084f5b2100c6 at /192.168.74.212:56696
2016-10-27 19:01:31,712 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Established session
0x158084f5b2100c6 with negotiated timeout 600000 for client /
192.168.74.212:56696
2016-10-27 19:01:56,681 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:02:20,029 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x158084f5b2100c5, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2016-10-27 19:02:20,030 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
client /192.168.74.212:53134 which had sessionid 0x158084f5b2100c5
2016-10-27 19:02:21,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
connection from /192.168.74.212:56718
2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
session 0x158084f5b2100c5 at /192.168.74.212:56718
2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Established session
0x158084f5b2100c5 with negotiated timeout 600000 for client /
192.168.74.212:56718
2016-10-27 19:02:26,682 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:02:56,683 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:03:05,743 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x158084f5b2100b9, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2016-10-27 19:03:05,744 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
client /192.168.74.203:51130 which had sessionid 0x158084f5b2100b9
2016-10-27 19:03:07,452 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
connection from /192.168.74.203:54676
2016-10-27 19:03:07,493 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
session 0x158084f5b2100b9 at /192.168.74.203:54676
2016-10-27 19:03:07,494 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Established session
0x158084f5b2100b9 with negotiated timeout 600000 for client /
192.168.74.203:54676
2016-10-27 19:03:26,684 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:03:53,712 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x158084f5b2100be, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2016-10-27 19:03:53,713 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
client /192.168.74.203:51146 which had sessionid 0x158084f5b2100be
2016-10-27 19:03:55,436 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
connection from /192.168.74.203:54694
2016-10-27 19:03:55,482 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
session 0x158084f5b2100be at /192.168.74.203:54694
2016-10-27 19:03:55,483 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181]
org.apache.zookeeper.server.ZooKeeperServer: Established session
0x158084f5b2100be with negotiated timeout 600000 for client /
192.168.74.203:54694
2016-10-27 19:03:56,719 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of
199 workers finished on superstep 0 on path
/_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir
2016-10-27 19:04:00,000 INFO [SessionTracker]
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x158084f5b2100b8, timeout of 600000ms exceeded
2016-10-27 19:04:00,001 INFO [SessionTracker]
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x158084f5b2100c2, timeout of 600000ms exceeded
2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: 0x158084f5b2100b8
2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: 0x158084f5b2100c2
2016-10-27 19:04:00,004 INFO [SyncThread:0]
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
client /192.168.74.203:51116 which had sessionid 0x158084f5b2100b8
2016-10-27 19:04:00,006 INFO [SyncThread:0]
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
client /192.168.74.212:53128 which had sessionid 0x158084f5b2100c2
2016-10-27 19:04:00,033 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: setJobState:
{"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1} on
superstep 0

Any Idea about this?

Thanks,

Hai


On Tue, Nov 8, 2016 at 6:37 AM, Denis Dudinski <denis.dudinski@gmail.com>
wrote:

> Hi Xenia,
>
> Thank you! I'll check the thread you mentioned.
>
> Best Regards,
> Denis Dudinski
>
> 2016-11-08 14:16 GMT+03:00 Xenia Demetriou <xeniad20@gmail.com>:
> > Hi Denis,
> >
> > For the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error  I
> > hope that the  conversation in below link can help you.
> >  www.mail-archive.com/user@giraph.apache.org/msg02938.html
> >
> > Regards,
> > Xenia
> >
> > 2016-11-08 12:25 GMT+02:00 Denis Dudinski <denis.dudinski@gmail.com>:
> >>
> >> Hi Hassan,
> >>
> >> Thank you for really quick response!
> >>
> >> I changed "giraph.isStaticGraph" to false and the error disappeared.
> >> As expected iteration became slow and wrote to disk edges once again
> >> in superstep 1.
> >>
> >> However, the computation failed at superstep 2 with error
> >> "java.lang.OutOfMemoryError: GC overhead limit exceeded". It seems to
> >> be unrelated to "isStaticGraph" issue, but I think it worth mentioning
> >> to see the picture as a whole.
> >>
> >> Are there any other tests/information I am able to execute/check to
> >> help to pinpoint "isStaticGraph" problem?
> >>
> >> Best Regards,
> >> Denis Dudinski
> >>
> >>
> >> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.eslami@gmail.com>:
> >> > Hi Denis,
> >> >
> >> > Thanks for bringing up the issue. In the previous conversation thread,
> >> > the
> >> > similar problem is reported even with a simpler example connected
> >> > component
> >> > calculation. Although, back then, we were developing other
> >> > performance-critical components of OOC.
> >> >
> >> > Let's debug this issue together to make the new OOC more stable. I
> >> > suspect
> >> > the problem is with "giraph.isStaticGraph=true" (as this is only an
> >> > optimization and most of our end-to-end testing was on cases where the
> >> > graph
> >> > could change). Let's get rid of it for now and see if the problem
> still
> >> > exists.
> >> >
> >> > Best,
> >> > Hassan
> >> >
> >> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski
> >> > <denis.dudinski@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> We are trying to calculate PageRank on huge graph, which does not fit
> >> >> into memory. For calculation to succeed we tried to turn on OutOfCore
> >> >> feature of Giraph, but every launch we tried resulted in
> >> >> com.esotericsoftware.kryo.KryoException: Buffer underflow.
> >> >> Each time it happens on different servers but exactly right after
> >> >> start of superstep 1.
> >> >>
> >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't
> >> >> back-step to Giraph's officially supported version and had to patch
> >> >> Giraph a little) placed on 11 servers + 3 master servers (namenodes
> >> >> etc.) with separate ZooKeeper cluster deployment.
> >> >>
> >> >> Our launch command:
> >> >>
> >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar
> >> >> org.apache.giraph.GiraphRunner com.prototype.di.pr.
> PageRankComputation
> >> >> \
> >> >> -mc com.prototype.di.pr.PageRankMasterCompute \
> >> >> -yj pr-job-jar-with-dependencies.jar \
> >> >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \
> >> >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
> >> >> -op /user/hadoop/output/pr_test \
> >> >> -w 10 \
> >> >> -c com.prototype.di.pr.PRDoubleCombiner \
> >> >> -wc com.prototype.di.pr.PageRankWorkerContext \
> >> >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
> >> >> -ca giraph.logLevel=info \
> >> >> -ca hbase.mapreduce.inputtable=di_test \
> >> >> -ca hbase.mapreduce.scan.columns=di:n \
> >> >> -ca hbase.defaults.for.version.skip=true \
> >> >> -ca hbase.table.row.textkey=false \
> >> >> -ca giraph.yarn.task.heap.mb=48000 \
> >> >> -ca giraph.isStaticGraph=true \
> >> >> -ca giraph.SplitMasterWorker=false \
> >> >> -ca giraph.oneToAllMsgSending=true \
> >> >> -ca giraph.metrics.enable=true \
> >> >> -ca giraph.jmap.histo.enable=true \
> >> >> -ca
> >> >> giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable
> \
> >> >> -ca
> >> >> giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable
> \
> >> >> -ca giraph.inputOutEdgesClass=org.apache.giraph.edge.
> LongNullArrayEdges
> >> >> \
> >> >> -ca giraph.useOutOfCoreGraph=true \
> >> >> -ca giraph.waitForPerWorkerRequests=true \
> >> >> -ca giraph.maxNumberOfUnsentRequests=1000 \
> >> >> -ca
> >> >>
> >> >> giraph.vertexInputFilterClass=com.prototype.di.pr.input.
> PagesFromSameDomainLimiter
> >> >> \
> >> >> -ca giraph.useInputSplitLocality=true \
> >> >> -ca hbase.mapreduce.scan.cachedrows=10000 \
> >> >> -ca giraph.minPartitionsPerComputeThread=60 \
> >> >> -ca
> >> >>
> >> >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr.
> DomainAwareGraphPartitionerFactory
> >> >> \
> >> >> -ca giraph.numInputThreads=1 \
> >> >> -ca giraph.inputSplitSamplePercent=20 \
> >> >> -ca giraph.pr.maxNeighborsPerVertex=50 \
> >> >> -ca
> >> >> giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition
> \
> >> >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \
> >> >> -ca
> >> >>
> >> >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/
> _bsp/_partitions
> >> >>
> >> >> Logs excerpt:
> >> >>
> >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in
> >> >> worker
> >> >> context
> >> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions
> >> >> to process with 1 compute thread(s), originally 1 thread(s) on
> >> >> superstep 1
> >> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60
> >> >> partitions in memory and 1 active threads
> >> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR
> >> >> computation
> >> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.75
> >> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
> >> >> updateActiveThreadsFraction: updating the number of active threads
to
> >> >> 1
> >> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
> >> >> updateRequestsCredit: updating the credit to 20
> >> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring:
> >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
> >> >> Failure, duration = 937ms
> >> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.72
> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.74
> >> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
> >> >> updateActiveThreadsFraction: updating the number of active threads
to
> >> >> 1
> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
> >> >> updateRequestsCredit: updating the credit to 20
> >> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.76
> >> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
> >> >> processing partition 234 is done!
> >> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.79
> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
> >> >> updateActiveThreadsFraction: updating the number of active threads
to
> >> >> 1
> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
> >> >> updateRequestsCredit: updating the credit to 18
> >> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window
> >> >> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave
> >> >> received req MBytes = 0.0138, secs waited = 30.058
> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.82
> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> >> >> IO command is: StorePartitionIOCommand: (partitionId = 234)
> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> >> command StorePartitionIOCommand: (partitionId = 234) completed:
> bytes=
> >> >> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC
> >> >> time)=175.03
> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.83
> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> >> >> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234)
> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> >> command StoreIncomingMessageIOCommand: (partitionId = 234) completed:
> >> >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC
> time)=NaN
> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.83
> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
> >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
> >> >> Failure, duration = 3107ms
> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
> >> >> name = PS MarkSweep, action = end of major GC, cause = Ergonomics,
> >> >> duration = 15064ms
> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
> >> >> updateActiveThreadsFraction: updating the number of active threads
to
> >> >> 1
> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
> >> >> updateRequestsCredit: updating the credit to 20
> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> >> usedMemoryFraction = 0.71
> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> >> >> IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep
> =
> >> >> 2)
> >> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016
> >> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
> >> >> processing partition 364 is done!
> >> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
> >> >> updateActiveThreadsFraction: updating the number of active threads
to
> >> >> 1
> >> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
> >> >> updateRequestsCredit: updating the credit to 20
> >> >> --
> >> >> -- num     #instances         #bytes  class name
> >> >> -- ----------------------------------------------
> >> >> --   1:     224004229    10752202992
> >> >> java.util.concurrent.ConcurrentHashMap$Node
> >> >> --   2:      19751666     6645730528  [B
> >> >> --   3:     222135985     5331263640
> >> >> com.belprime.di.pr.DomainPartAwareLongWritable
> >> >> --   4:     214686483     5152475592
> >> >> org.apache.hadoop.io.DoubleWritable
> >> >> --   5:           353     4357261784
> >> >> [Ljava.util.concurrent.ConcurrentHashMap$Node;
> >> >> --   6:        486266      204484688  [I
> >> >> --   7:       6017652      192564864
> >> >> org.apache.giraph.utils.UnsafeByteArrayOutputStream
> >> >> --   8:       3986203      159448120
> >> >> org.apache.giraph.utils.UnsafeByteArrayInputStream
> >> >> --   9:       2064182      148621104
> >> >> org.apache.giraph.graph.ByteValueVertex
> >> >> --  10:       2064182       82567280
> >> >> org.apache.giraph.edge.ByteArrayEdges
> >> >> --  11:       1886875       45285000  java.lang.Integer
> >> >> --  12:        349409       30747992
> >> >> java.util.concurrent.ConcurrentHashMap$TreeNode
> >> >> --  13:        916970       29343040  java.util.Collections$1
> >> >> --  14:        916971       22007304
> >> >> java.util.Collections$SingletonSet
> >> >> --  15:         47270        3781600
> >> >> java.util.concurrent.ConcurrentHashMap$TreeBin
> >> >> --  16:         26201        2590912  [C
> >> >> --  17:         34175        1367000
> >> >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
> >> >> --  18:          6143        1067704  java.lang.Class
> >> >> --  19:         25953         830496  java.lang.String
> >> >> --  20:         34175         820200
> >> >> org.apache.giraph.edge.EdgeNoValue
> >> >> --  21:          4488         703400  [Ljava.lang.Object;
> >> >> --  22:            70         395424  [Ljava.nio.channels.
> SelectionKey;
> >> >> --  23:          2052         328320  java.lang.reflect.Method
> >> >> --  24:          6600         316800
> >> >> org.apache.giraph.utils.ByteArrayVertexIdMessages
> >> >> --  25:          5781         277488  java.util.HashMap$Node
> >> >> --  26:          5651         271248  java.util.Hashtable$Entry
> >> >> --  27:          6604         211328
> >> >> org.apache.giraph.factories.DefaultMessageValueFactory
> >> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of
> >> >> callable failed
> >> >> java.lang.RuntimeException: call: execution of IO command
> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> >> >> at
> >> >>
> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> >> at
> >> >>
> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >> >> at
> >> >>
> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >> >> at java.lang.Thread.run(Thread.java:745)
> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
> underflow.
> >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> >> >> at
> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(
> UnsafeInput.java:112)
> >> >> at
> >> >>
> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> >> >> at org.apache.hadoop.io.LongWritable.readFields(
> LongWritable.java:47)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.
> loadPartitionDataProxy(DiskBackedDataStore.java:234)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadPartitionData(DiskBackedPartitionStore.java:311)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> >> >> ... 6 more
> >> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException:
> >> >> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution
of
> >> >> IO command LoadPartitionIOCommand: (partitionId = 234, superstep =
2)
> >> >> failed!, exiting...
> >> >> java.lang.RuntimeException: call: execution of IO command
> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> >> >> at
> >> >>
> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> >> at
> >> >>
> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >> >> at
> >> >>
> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >> >> at java.lang.Thread.run(Thread.java:745)
> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
> underflow.
> >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> >> >> at
> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(
> UnsafeInput.java:112)
> >> >> at
> >> >>
> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> >> >> at org.apache.hadoop.io.LongWritable.readFields(
> LongWritable.java:47)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.
> loadPartitionDataProxy(DiskBackedDataStore.java:234)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadPartitionData(DiskBackedPartitionStore.java:311)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> >> >> at
> >> >>
> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> >> >> ... 6 more
> >> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth:
> Got
> >> >> failure, unregistering health on
> >> >>
> >> >>
> >> >> /_hadoopBsp/giraph_yarn_application_1478342673283_
> 0009/_applicationAttemptsDir/0/_superstepDir/1/_
> workerHealthyDir/datanode6.webmeup.com_5
> >> >> on superstep 1
> >> >>
> >> >> We looked into one thread
> >> >>
> >> >>
> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%
> 3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%
> 3DoLJAAPQw%40mail.gmail.com%3E
> >> >> but it is rather old and at that time the answer was "do not use it
> >> >> yet".
> >> >> (see reply
> >> >>
> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%
> 3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%
> 40mail.gmail.com%3E).
> >> >> Does it hold today? We would like to use new advanced adaptive OOC
> >> >> approach if possible...
> >> >>
> >> >> Thank you in advance, any help or hint would be really appreciated.
> >> >>
> >> >> Best Regards,
> >> >> Denis Dudinski
> >> >
> >> >
> >
> >
>

Mime
View raw message