giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hassan Eslami <hsn.esl...@gmail.com>
Subject Re: Out of core computation fails with KryoException: Buffer underflow
Date Mon, 07 Nov 2016 17:00:15 GMT
Hi Denis,

Thanks for bringing up the issue. In the previous conversation thread, the
similar problem is reported even with a simpler example connected component
calculation. Although, back then, we were developing other
performance-critical components of OOC.

Let's debug this issue together to make the new OOC more stable. I suspect
the problem is with "giraph.isStaticGraph=true" (as this is only an
optimization and most of our end-to-end testing was on cases where the
graph could change). Let's get rid of it for now and see if the problem
still exists.

Best,
Hassan

On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski <denis.dudinski@gmail.com>
wrote:

> Hello,
>
> We are trying to calculate PageRank on huge graph, which does not fit
> into memory. For calculation to succeed we tried to turn on OutOfCore
> feature of Giraph, but every launch we tried resulted in
> com.esotericsoftware.kryo.KryoException: Buffer underflow.
> Each time it happens on different servers but exactly right after
> start of superstep 1.
>
> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't
> back-step to Giraph's officially supported version and had to patch
> Giraph a little) placed on 11 servers + 3 master servers (namenodes
> etc.) with separate ZooKeeper cluster deployment.
>
> Our launch command:
>
> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar
> org.apache.giraph.GiraphRunner com.prototype.di.pr.PageRankComputation
> \
> -mc com.prototype.di.pr.PageRankMasterCompute \
> -yj pr-job-jar-with-dependencies.jar \
> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \
> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
> -op /user/hadoop/output/pr_test \
> -w 10 \
> -c com.prototype.di.pr.PRDoubleCombiner \
> -wc com.prototype.di.pr.PageRankWorkerContext \
> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
> -ca giraph.logLevel=info \
> -ca hbase.mapreduce.inputtable=di_test \
> -ca hbase.mapreduce.scan.columns=di:n \
> -ca hbase.defaults.for.version.skip=true \
> -ca hbase.table.row.textkey=false \
> -ca giraph.yarn.task.heap.mb=48000 \
> -ca giraph.isStaticGraph=true \
> -ca giraph.SplitMasterWorker=false \
> -ca giraph.oneToAllMsgSending=true \
> -ca giraph.metrics.enable=true \
> -ca giraph.jmap.histo.enable=true \
> -ca giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable \
> -ca giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable \
> -ca giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges \
> -ca giraph.useOutOfCoreGraph=true \
> -ca giraph.waitForPerWorkerRequests=true \
> -ca giraph.maxNumberOfUnsentRequests=1000 \
> -ca giraph.vertexInputFilterClass=com.prototype.di.pr.input.
> PagesFromSameDomainLimiter
> \
> -ca giraph.useInputSplitLocality=true \
> -ca hbase.mapreduce.scan.cachedrows=10000 \
> -ca giraph.minPartitionsPerComputeThread=60 \
> -ca giraph.graphPartitionerFactoryClass=com.prototype.di.pr.
> DomainAwareGraphPartitionerFactory
> \
> -ca giraph.numInputThreads=1 \
> -ca giraph.inputSplitSamplePercent=20 \
> -ca giraph.pr.maxNeighborsPerVertex=50 \
> -ca giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition \
> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \
> -ca giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/
> _bsp/_partitions
>
> Logs excerpt:
>
> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in worker
> context
> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions
> to process with 1 compute thread(s), originally 1 thread(s) on
> superstep 1
> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60
> partitions in memory and 1 active threads
> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR
> computation
> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.75
> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
> updateActiveThreadsFraction: updating the number of active threads to
> 1
> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
> updateRequestsCredit: updating the credit to 20
> 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring:
> name = PS Scavenge, action = end of minor GC, cause = Allocation
> Failure, duration = 937ms
> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.72
> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.74
> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
> updateActiveThreadsFraction: updating the number of active threads to
> 1
> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
> updateRequestsCredit: updating the credit to 20
> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.76
> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
> processing partition 234 is done!
> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.79
> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
> updateActiveThreadsFraction: updating the number of active threads to
> 1
> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
> updateRequestsCredit: updating the credit to 18
> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window
> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave
> received req MBytes = 0.0138, secs waited = 30.058
> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.82
> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> IO command is: StorePartitionIOCommand: (partitionId = 234)
> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> command StorePartitionIOCommand: (partitionId = 234) completed: bytes=
> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC
> time)=175.03
> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.83
> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234)
> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> command StoreIncomingMessageIOCommand: (partitionId = 234) completed:
> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC time)=NaN
> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.83
> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
> name = PS Scavenge, action = end of minor GC, cause = Allocation
> Failure, duration = 3107ms
> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
> name = PS MarkSweep, action = end of major GC, cause = Ergonomics,
> duration = 15064ms
> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
> updateActiveThreadsFraction: updating the number of active threads to
> 1
> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
> updateRequestsCredit: updating the credit to 20
> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions:
> usedMemoryFraction = 0.71
> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep =
> 2)
> JMap histo dump at Sun Nov 06 15:47:41 CET 2016
> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
> processing partition 364 is done!
> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
> updateActiveThreadsFraction: updating the number of active threads to
> 1
> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
> updateRequestsCredit: updating the credit to 20
> --
> -- num     #instances         #bytes  class name
> -- ----------------------------------------------
> --   1:     224004229    10752202992
> java.util.concurrent.ConcurrentHashMap$Node
> --   2:      19751666     6645730528  [B
> --   3:     222135985     5331263640
> com.belprime.di.pr.DomainPartAwareLongWritable
> --   4:     214686483     5152475592  org.apache.hadoop.io.DoubleWritable
> --   5:           353     4357261784
> [Ljava.util.concurrent.ConcurrentHashMap$Node;
> --   6:        486266      204484688  [I
> --   7:       6017652      192564864
> org.apache.giraph.utils.UnsafeByteArrayOutputStream
> --   8:       3986203      159448120
> org.apache.giraph.utils.UnsafeByteArrayInputStream
> --   9:       2064182      148621104  org.apache.giraph.graph.
> ByteValueVertex
> --  10:       2064182       82567280  org.apache.giraph.edge.
> ByteArrayEdges
> --  11:       1886875       45285000  java.lang.Integer
> --  12:        349409       30747992
> java.util.concurrent.ConcurrentHashMap$TreeNode
> --  13:        916970       29343040  java.util.Collections$1
> --  14:        916971       22007304  java.util.Collections$SingletonSet
> --  15:         47270        3781600
> java.util.concurrent.ConcurrentHashMap$TreeBin
> --  16:         26201        2590912  [C
> --  17:         34175        1367000
> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
> --  18:          6143        1067704  java.lang.Class
> --  19:         25953         830496  java.lang.String
> --  20:         34175         820200  org.apache.giraph.edge.EdgeNoValue
> --  21:          4488         703400  [Ljava.lang.Object;
> --  22:            70         395424  [Ljava.nio.channels.SelectionKey;
> --  23:          2052         328320  java.lang.reflect.Method
> --  24:          6600         316800
> org.apache.giraph.utils.ByteArrayVertexIdMessages
> --  25:          5781         277488  java.util.HashMap$Node
> --  26:          5651         271248  java.util.Hashtable$Entry
> --  27:          6604         211328
> org.apache.giraph.factories.DefaultMessageValueFactory
> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of
> callable failed
> java.lang.RuntimeException: call: execution of IO command
> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> at org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> at org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> at org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow.
> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112)
> at com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
> at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> at org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> at org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(
> DiskBackedDataStore.java:234)
> at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(
> DiskBackedPartitionStore.java:311)
> at org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> at org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> ... 6 more
> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException:
> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of
> IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2)
> failed!, exiting...
> java.lang.RuntimeException: call: execution of IO command
> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> at org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> at org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> at org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow.
> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112)
> at com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
> at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> at org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> at org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(
> DiskBackedDataStore.java:234)
> at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(
> DiskBackedPartitionStore.java:311)
> at org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> at org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> ... 6 more
> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: Got
> failure, unregistering health on
> /_hadoopBsp/giraph_yarn_application_1478342673283_
> 0009/_applicationAttemptsDir/0/_superstepDir/1/_
> workerHealthyDir/datanode6.webmeup.com_5
> on superstep 1
>
> We looked into one thread
> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%
> 3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%
> 3DoLJAAPQw%40mail.gmail.com%3E
> but it is rather old and at that time the answer was "do not use it
> yet".
> (see reply http://mail-archives.apache.org/mod_mbox/giraph-user/
> 201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%
> 40mail.gmail.com%3E).
> Does it hold today? We would like to use new advanced adaptive OOC
> approach if possible...
>
> Thank you in advance, any help or hint would be really appreciated.
>
> Best Regards,
> Denis Dudinski
>

Mime
View raw message