giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hai Lan <lanhai1...@gmail.com>
Subject Re: Out of core computation fails with KryoException: Buffer underflow
Date Tue, 08 Nov 2016 17:01:17 GMT
Hello Denis

Thanks for your quick response.

I just tested to set the timeout as 3600000. And it seems like superstep 0
can be finished now. However, the job is killed immediately when superstep
1 start. In zookeeper log:

2016-11-08 11:54:13,569 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: checkWorkers: Only found
198 responses of 199 needed to start superstep 1.  Reporting every
30000 msecs, 511036 more msecs left before giving up.
2016-11-08 11:54:13,570 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster:
logMissingWorkersOnSuperstep: No response from partition 13 (could be
master)
2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba30000
type:create cxid:0x14e81 zxid:0xc76 txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba30000
type:create cxid:0x14e82 zxid:0xc77 txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
2016-11-08 11:54:21,045 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba30000
type:create cxid:0x14f4b zxid:0xc79 txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
2016-11-08 11:54:21,046 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba30000
type:create cxid:0x14f4c zxid:0xc7a txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
2016-11-08 11:54:21,094 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.comm.netty.NettyClient: connectAllAddresses:
Successfully added 0 connections, (0 total connected) 0 failed, 0
failures total.
2016-11-08 11:54:21,095 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.partition.PartitionBalancer:
balancePartitionsAcrossWorkers: Using algorithm static
2016-11-08 11:54:21,097 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
[Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=22,
port=30022):(v=48825003, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=51,
port=30051):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87,
port=30087):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=99,
port=30099):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=159,
port=30159):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=189,
port=30189):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=166,
port=30166):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=172,
port=30172):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu
hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=195,
port=30195):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=116,
port=30116):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=154,
port=30154):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu
hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=2,
port=30002):(v=58590001, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=123,
port=30123):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=52,
port=30052):(v=48825001, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=188,
port=30188):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=165,
port=30165):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=171,
port=30171):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=23,
port=30023):(v=48825003, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=117,
port=30117):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=20,
port=30020):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=89,
port=30089):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=53,
port=30053):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=168,
port=30168):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=187,
port=30187):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=179,
port=30179):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=118,
port=30118):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=75,
port=30075):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=152,
port=30152):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=21,
port=30021):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=88,
port=30088):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=180,
port=30180):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=54,
port=30054):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=76,
port=30076):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=119,
port=30119):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=167,
port=30167):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=153,
port=30153):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu
hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=196,
port=30196):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=170,
port=30170):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=103,
port=30103):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=156,
port=30156):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=120,
port=30120):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=150,
port=30150):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=67,
port=30067):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=59,
port=30059):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=84,
port=30084):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=19,
port=30019):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=102,
port=30102):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=169,
port=30169):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=34,
port=30034):(v=48825003, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=162,
port=30162):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=157,
port=30157):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=83,
port=30083):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=151,
port=30151):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=121,
port=30121):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=131,
port=30131):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=101,
port=30101):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=161,
port=30161):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=122,
port=30122):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=158,
port=30158):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=148,
port=30148):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=86,
port=30086):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=140,
port=30140):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=91,
port=30091):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=100,
port=30100):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=160,
port=30160):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=149,
port=30149):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=85,
port=30085):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=139,
port=30139):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=13,
port=30013):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=80,
port=30080):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=92,
port=30092):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=112,
port=30112):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu
hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=147,
port=30147):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=184,
port=30184):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=8,
port=30008):(v=48825004, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=45,
port=30045):(v=48825003, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=58,
port=30058):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=32,
port=30032):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=106,
port=30106):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=63,
port=30063):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=142,
port=30142):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=71,
port=30071):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=40,
port=30040):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=130,
port=30130):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=39,
port=30039):(v=48825003, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=111,
port=30111):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=93,
port=30093):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=14,
port=30014):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=7,
port=30007):(v=48825004, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=185,
port=30185):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=46,
port=30046):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=33,
port=30033):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=105,
port=30105):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=62,
port=30062):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=141,
port=30141):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=70,
port=30070):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=135,
port=30135):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=186,
port=30186):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=94,
port=30094):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=114,
port=30114):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=43,
port=30043):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=30,
port=30030):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=128,
port=30128):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=144,
port=30144):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=69,
port=30069):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=194,
port=30194):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10,
port=30010):(v=48825004, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=132,
port=30132):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu
hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=104,
port=30104):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=61,
port=30061):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=11,
port=30011):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=42,
port=30042):(v=48825003, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=178,
port=30178):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=12,
port=30012):(v=48825003, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=134,
port=30134):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=113,
port=30113):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=44,
port=30044):(v=48825003, e=0),Worker(hostname=trantor06.umiacs.umd.edu
hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=155,
port=30155):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=31,
port=30031):(v=48825003, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=68,
port=30068):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=9,
port=30009):(v=48825004, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=95,
port=30095):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=143,
port=30143):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=133,
port=30133):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=60,
port=30060):(v=48824999, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=129,
port=30129):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=177,
port=30177):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=41,
port=30041):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu
hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=198,
port=30198):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=193,
port=30193):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=145,
port=30145):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=66,
port=30066):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=49,
port=30049):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=28,
port=30028):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=4,
port=30004):(v=58590001, e=0),Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=26,
port=30026):(v=48825003, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=181,
port=30181):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=55,
port=30055):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=96,
port=30096):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=17,
port=30017):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=36,
port=30036):(v=48825003, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=176,
port=30176):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=108,
port=30108):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=77,
port=30077):(v=48824999, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=126,
port=30126):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=136,
port=30136):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu
hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=197,
port=30197):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=90,
port=30090):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=29,
port=30029):(v=48825003, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=192,
port=30192):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=50,
port=30050):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=3,
port=30003):(v=58590001, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=56,
port=30056):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=97,
port=30097):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=18,
port=30018):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=127,
port=30127):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=35,
port=30035):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=78,
port=30078):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=175,
port=30175):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=82,
port=30082):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=107,
port=30107):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=74,
port=30074):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=47,
port=30047):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu
hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=1,
port=30001):(v=58589999, e=0),Worker(hostname=trantor23.umiacs.umd.edu
hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=115,
port=30115):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=38,
port=30038):(v=48825003, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=110,
port=30110):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=15,
port=30015):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=124,
port=30124):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=6,
port=30006):(v=48825004, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=191,
port=30191):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=182,
port=30182):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=174,
port=30174):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu
hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=98,
port=30098):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=164,
port=30164):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=65,
port=30065):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=24,
port=30024):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=79,
port=30079):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=73,
port=30073):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=138,
port=30138):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu
hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=81,
port=30081):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5,
port=30005):(v=58590001, e=0),Worker(hostname=trantor12.umiacs.umd.edu
hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=190,
port=30190):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu
hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=27,
port=30027):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu
hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=37,
port=30037):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu
hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=199,
port=30199):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu
hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=146,
port=30146):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu
hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=57,
port=30057):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu
hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=48,
port=30048):(v=48825003, e=0),Worker(hostname=trantor05.umiacs.umd.edu
hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=183,
port=30183):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu
hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=173,
port=30173):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu
hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=25,
port=30025):(v=48825003, e=0),Worker(hostname=trantor01.umiacs.umd.edu
hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=64,
port=30064):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu
hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=16,
port=30016):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu
hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=125,
port=30125):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu
hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=109,
port=30109):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu
hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=163,
port=30163):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu
hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=72,
port=30072):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu
hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=137,
port=30137):(v=48824999, e=0),]
2016-11-08 11:54:21,098 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Vertices - Mean: 49070351, Min:
Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087) -
48824999, Max: Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005) - 58590001
2016-11-08 11:54:21,098 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Edges - Mean: 0, Min: Worker(hostname=trantor17.umiacs.umd.edu
hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087) - 0, Max:
Worker(hostname=trantor11.umiacs.umd.edu
hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005) - 0
2016-11-08 11:54:21,104 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 199 workers finished on superstep 1 on path
/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerFinishedDir
2016-11-08 11:54:29,090 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: setJobState:
{"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1}
on superstep 1
2016-11-08 11:54:29,094 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba30044
type:create cxid:0x1b zxid:0xd46 txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_masterJobState
2016-11-08 11:54:29,094 INFO [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: setJobState:
{"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1}
2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba3004a
type:create cxid:0x1b zxid:0xd47 txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_masterJobState
2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba3004f
type:create cxid:0x1b zxid:0xd48 txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_masterJobState
2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):]
org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
KeeperException when processing sessionid:0x15844b61ba30054
type:create cxid:0x1b zxid:0xd49 txntype:-1 reqpath:n/a Error
Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState
Error:KeeperErrorCode = NodeExists for
/_hadoopBsp/job_1477020594559_0051/_masterJobState
2016-11-08 11:54:29,096 FATAL [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: failJob: Killing job
job_1477020594559_0051


Any other ideas?


Thanks,


BR,


Hai





On Tue, Nov 8, 2016 at 9:48 AM, Denis Dudinski <denis.dudinski@gmail.com>
wrote:

> Hi Hai,
>
> I think we saw something like this in our environment.
>
> Interesting row is this one:
> 2016-10-27 19:04:00,000 INFO [SessionTracker]
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x158084f5b2100b8, timeout of 600000ms exceeded
>
> I think that one of workers due to some reason did not communicate
> with ZooKeeper for quite a long time (it may be heavy network load or
> high CPU consumption, see your monitoring infrastructure, it should
> give you a hint). ZooKeeper session expires and all ephemeral nodes
> for that worker in ZooKeeper tree are deleted. Master thinks that
> worker is dead and halts computation.
>
> Your ZooKeeper session timeout is 600000 ms which is 10 minutes. We
> set this value to much more higher value equal to 1 hour and were able
> to perform computations successfully.
>
> I hope it will help in your case too.
>
> Best Regards,
> Denis Dudinski
>
> 2016-11-08 16:43 GMT+03:00 Hai Lan <lanhai1988@gmail.com>:
> > Hi Guys
> >
> > The OutOfMemoryError might be solved be adding
> > "-Dmapreduce.map.memory.mb=14848". But in my tests, I found some more
> > problems during running out of core graph.
> >
> > I did two tests with 150G 10^10 vertices input in 1.2 version, and it
> seems
> > like it not necessary to add like
> > "giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1" cause
> it is
> > adaptive. However, If I run without setting "userPartitionCount and
> > maxPartitionsInMemory", it will it will keep running on superstep -1
> > forever. None of worker can finish superstep -1. And I can see a warn in
> > zookeeper log, not sure if it is the problem:
> >
> > WARN [netty-client-worker-3]
> > org.apache.giraph.comm.netty.handler.ResponseClientHandler:
> exceptionCaught:
> > Channel failed with remote address
> > trantor21.umiacs.umd.edu/192.168.74.221:30172
> > java.lang.ArrayIndexOutOfBoundsException: 1075052544
> >       at
> > org.apache.giraph.comm.flow_control.NoOpFlowControl.getAckSignalFlag(
> NoOpFlowControl.java:52)
> >       at
> > org.apache.giraph.comm.netty.NettyClient.messageReceived(
> NettyClient.java:796)
> >       at
> > org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(
> ResponseClientHandler.java:87)
> >       at
> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(
> DefaultChannelHandlerContext.java:338)
> >       at
> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(
> DefaultChannelHandlerContext.java:324)
> >       at
> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(
> ByteToMessageDecoder.java:153)
> >       at
> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(
> DefaultChannelHandlerContext.java:338)
> >       at
> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(
> DefaultChannelHandlerContext.java:324)
> >       at
> > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(
> InboundByteCounter.java:74)
> >       at
> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(
> DefaultChannelHandlerContext.java:338)
> >       at
> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(
> DefaultChannelHandlerContext.java:324)
> >       at
> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(
> DefaultChannelPipeline.java:785)
> >       at
> > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(
> AbstractNioByteChannel.java:126)
> >       at
> > io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:485)
> >       at
> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:452)
> >       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
> >       at
> > io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:101)
> >       at java.lang.Thread.run(Thread.java:745)
> >
> >
> >
> > If I add giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1.
> > Whole command is :
> >
> > hadoop jar
> > /home/hlan/giraph-1.2.0-hadoop2/giraph-examples/
> target/giraph-examples-1.2.0-hadoop2-for-hadoop-2.6.0-jar-
> with-dependencies.jar
> > org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreGraph=true
> > -Ddigraph.block_factory_configurators=org.apache.giraph.conf.
> FacebookConfiguration
> > -Dmapreduce.map.memory.mb=14848 org.apache.giraph.examples.myTask -vif
> > org.apache.giraph.examples.LongFloatNullTextInputFormat -vip
> > /user/hlan/cube/tmp/out/ -vof
> > org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
> > /user/hlan/output -w 199 -ca
> > mapred.job.tracker=localhost:5431,steps=6,giraph.
> isStaticGraph=true,giraph.numInputThreads=10,giraph.
> userPartitionCount=1000,giraph.maxPartitionsInMemory=1
> >
> > the job will be pass superstep -1 very quick (around 10 mins). But it
> will
> > be killed near end of superstep 0.
> >
> > 2016-10-27 18:53:56,607 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
> Vertices
> > - Mean: 9810049, Min: Worker(hostname=trantor11.umiacs.umd.edu
> > hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 9771533,
> Max:
> > Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.
> edu,
> > MRtaskID=49, port=30049) - 9995724
> > 2016-10-27 18:53:56,608 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
> Edges -
> > Mean: 0, Min: Worker(hostname=trantor11.umiacs.umd.edu
> > hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 0, Max:
> > Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.
> edu,
> > MRtaskID=49, port=30049) - 0
> > 2016-10-27 18:53:56,634 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:54:26,638 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:54:56,640 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:55:26,641 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:55:56,642 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:56:26,643 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:56:56,644 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:57:26,645 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:57:56,646 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:58:26,647 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:58:56,675 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:59:26,676 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 18:59:56,677 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:00:26,678 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:00:56,679 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:01:26,680 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:01:29,610 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
> exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x158084f5b2100c6, likely client has closed socket
> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(
> NIOServerCnxn.java:220)
> > at
> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(
> NIOServerCnxnFactory.java:208)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-10-27 19:01:29,612 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> > client /192.168.74.212:53136 which had sessionid 0x158084f5b2100c6
> > 2016-10-27 19:01:31,702 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
> connection
> > from /192.168.74.212:56696
> > 2016-10-27 19:01:31,711 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> > session 0x158084f5b2100c6 at /192.168.74.212:56696
> > 2016-10-27 19:01:31,712 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Established session
> > 0x158084f5b2100c6 with negotiated timeout 600000 for client
> > /192.168.74.212:56696
> > 2016-10-27 19:01:56,681 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:02:20,029 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
> exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x158084f5b2100c5, likely client has closed socket
> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(
> NIOServerCnxn.java:220)
> > at
> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(
> NIOServerCnxnFactory.java:208)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-10-27 19:02:20,030 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> > client /192.168.74.212:53134 which had sessionid 0x158084f5b2100c5
> > 2016-10-27 19:02:21,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
> connection
> > from /192.168.74.212:56718
> > 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> > session 0x158084f5b2100c5 at /192.168.74.212:56718
> > 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Established session
> > 0x158084f5b2100c5 with negotiated timeout 600000 for client
> > /192.168.74.212:56718
> > 2016-10-27 19:02:26,682 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:02:56,683 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:03:05,743 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
> exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x158084f5b2100b9, likely client has closed socket
> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(
> NIOServerCnxn.java:220)
> > at
> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(
> NIOServerCnxnFactory.java:208)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-10-27 19:03:05,744 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> > client /192.168.74.203:51130 which had sessionid 0x158084f5b2100b9
> > 2016-10-27 19:03:07,452 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
> connection
> > from /192.168.74.203:54676
> > 2016-10-27 19:03:07,493 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> > session 0x158084f5b2100b9 at /192.168.74.203:54676
> > 2016-10-27 19:03:07,494 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Established session
> > 0x158084f5b2100b9 with negotiated timeout 600000 for client
> > /192.168.74.203:54676
> > 2016-10-27 19:03:26,684 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:03:53,712 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
> exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x158084f5b2100be, likely client has closed socket
> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(
> NIOServerCnxn.java:220)
> > at
> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(
> NIOServerCnxnFactory.java:208)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-10-27 19:03:53,713 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> > client /192.168.74.203:51146 which had sessionid 0x158084f5b2100be
> > 2016-10-27 19:03:55,436 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
> connection
> > from /192.168.74.203:54694
> > 2016-10-27 19:03:55,482 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew
> > session 0x158084f5b2100be at /192.168.74.203:54694
> > 2016-10-27 19:03:55,483 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181
> ]
> > org.apache.zookeeper.server.ZooKeeperServer: Established session
> > 0x158084f5b2100be with negotiated timeout 600000 for client
> > /192.168.74.203:54694
> > 2016-10-27 19:03:56,719 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
> of 199
> > workers finished on superstep 0 on path
> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/
> 0/_superstepDir/0/_workerFinishedDir
> > 2016-10-27 19:04:00,000 INFO [SessionTracker]
> > org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> > 0x158084f5b2100b8, timeout of 600000ms exceeded
> > 2016-10-27 19:04:00,001 INFO [SessionTracker]
> > org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> > 0x158084f5b2100c2, timeout of 600000ms exceeded
> > 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> > termination for sessionid: 0x158084f5b2100b8
> > 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> > termination for sessionid: 0x158084f5b2100c2
> > 2016-10-27 19:04:00,004 INFO [SyncThread:0]
> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> > client /192.168.74.203:51116 which had sessionid 0x158084f5b2100b8
> > 2016-10-27 19:04:00,006 INFO [SyncThread:0]
> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for
> > client /192.168.74.212:53128 which had sessionid 0x158084f5b2100c2
> > 2016-10-27 19:04:00,033 INFO [org.apache.giraph.master.MasterThread]
> > org.apache.giraph.master.BspServiceMaster: setJobState:
> > {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1} on
> > superstep 0
> >
> > Any Idea about this?
> >
> > Thanks,
> >
> > Hai
> >
> >
> > On Tue, Nov 8, 2016 at 6:37 AM, Denis Dudinski <denis.dudinski@gmail.com
> >
> > wrote:
> >>
> >> Hi Xenia,
> >>
> >> Thank you! I'll check the thread you mentioned.
> >>
> >> Best Regards,
> >> Denis Dudinski
> >>
> >> 2016-11-08 14:16 GMT+03:00 Xenia Demetriou <xeniad20@gmail.com>:
> >> > Hi Denis,
> >> >
> >> > For the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error
> >> > I
> >> > hope that the  conversation in below link can help you.
> >> >  www.mail-archive.com/user@giraph.apache.org/msg02938.html
> >> >
> >> > Regards,
> >> > Xenia
> >> >
> >> > 2016-11-08 12:25 GMT+02:00 Denis Dudinski <denis.dudinski@gmail.com>:
> >> >>
> >> >> Hi Hassan,
> >> >>
> >> >> Thank you for really quick response!
> >> >>
> >> >> I changed "giraph.isStaticGraph" to false and the error disappeared.
> >> >> As expected iteration became slow and wrote to disk edges once again
> >> >> in superstep 1.
> >> >>
> >> >> However, the computation failed at superstep 2 with error
> >> >> "java.lang.OutOfMemoryError: GC overhead limit exceeded". It seems to
> >> >> be unrelated to "isStaticGraph" issue, but I think it worth
> mentioning
> >> >> to see the picture as a whole.
> >> >>
> >> >> Are there any other tests/information I am able to execute/check to
> >> >> help to pinpoint "isStaticGraph" problem?
> >> >>
> >> >> Best Regards,
> >> >> Denis Dudinski
> >> >>
> >> >>
> >> >> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.eslami@gmail.com>:
> >> >> > Hi Denis,
> >> >> >
> >> >> > Thanks for bringing up the issue. In the previous conversation
> >> >> > thread,
> >> >> > the
> >> >> > similar problem is reported even with a simpler example connected
> >> >> > component
> >> >> > calculation. Although, back then, we were developing other
> >> >> > performance-critical components of OOC.
> >> >> >
> >> >> > Let's debug this issue together to make the new OOC more stable. I
> >> >> > suspect
> >> >> > the problem is with "giraph.isStaticGraph=true" (as this is only an
> >> >> > optimization and most of our end-to-end testing was on cases where
> >> >> > the
> >> >> > graph
> >> >> > could change). Let's get rid of it for now and see if the problem
> >> >> > still
> >> >> > exists.
> >> >> >
> >> >> > Best,
> >> >> > Hassan
> >> >> >
> >> >> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski
> >> >> > <denis.dudinski@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hello,
> >> >> >>
> >> >> >> We are trying to calculate PageRank on huge graph, which does not
> >> >> >> fit
> >> >> >> into memory. For calculation to succeed we tried to turn on
> >> >> >> OutOfCore
> >> >> >> feature of Giraph, but every launch we tried resulted in
> >> >> >> com.esotericsoftware.kryo.KryoException: Buffer underflow.
> >> >> >> Each time it happens on different servers but exactly right after
> >> >> >> start of superstep 1.
> >> >> >>
> >> >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't
> >> >> >> back-step to Giraph's officially supported version and had to
> patch
> >> >> >> Giraph a little) placed on 11 servers + 3 master servers
> (namenodes
> >> >> >> etc.) with separate ZooKeeper cluster deployment.
> >> >> >>
> >> >> >> Our launch command:
> >> >> >>
> >> >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar
> >> >> >> org.apache.giraph.GiraphRunner
> >> >> >> com.prototype.di.pr.PageRankComputation
> >> >> >> \
> >> >> >> -mc com.prototype.di.pr.PageRankMasterCompute \
> >> >> >> -yj pr-job-jar-with-dependencies.jar \
> >> >> >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \
> >> >> >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
> >> >> >> -op /user/hadoop/output/pr_test \
> >> >> >> -w 10 \
> >> >> >> -c com.prototype.di.pr.PRDoubleCombiner \
> >> >> >> -wc com.prototype.di.pr.PageRankWorkerContext \
> >> >> >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
> >> >> >> -ca giraph.logLevel=info \
> >> >> >> -ca hbase.mapreduce.inputtable=di_test \
> >> >> >> -ca hbase.mapreduce.scan.columns=di:n \
> >> >> >> -ca hbase.defaults.for.version.skip=true \
> >> >> >> -ca hbase.table.row.textkey=false \
> >> >> >> -ca giraph.yarn.task.heap.mb=48000 \
> >> >> >> -ca giraph.isStaticGraph=true \
> >> >> >> -ca giraph.SplitMasterWorker=false \
> >> >> >> -ca giraph.oneToAllMsgSending=true \
> >> >> >> -ca giraph.metrics.enable=true \
> >> >> >> -ca giraph.jmap.histo.enable=true \
> >> >> >> -ca
> >> >> >> giraph.vertexIdClass=com.prototype.di.pr.
> DomainPartAwareLongWritable
> >> >> >> \
> >> >> >> -ca
> >> >> >> giraph.outgoingMessageValueClass=org.apache.hadoop.io.
> DoubleWritable
> >> >> >> \
> >> >> >> -ca
> >> >> >> giraph.inputOutEdgesClass=org.apache.giraph.edge.
> LongNullArrayEdges
> >> >> >> \
> >> >> >> -ca giraph.useOutOfCoreGraph=true \
> >> >> >> -ca giraph.waitForPerWorkerRequests=true \
> >> >> >> -ca giraph.maxNumberOfUnsentRequests=1000 \
> >> >> >> -ca
> >> >> >>
> >> >> >>
> >> >> >> giraph.vertexInputFilterClass=com.prototype.di.pr.input.
> PagesFromSameDomainLimiter
> >> >> >> \
> >> >> >> -ca giraph.useInputSplitLocality=true \
> >> >> >> -ca hbase.mapreduce.scan.cachedrows=10000 \
> >> >> >> -ca giraph.minPartitionsPerComputeThread=60 \
> >> >> >> -ca
> >> >> >>
> >> >> >>
> >> >> >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr.
> DomainAwareGraphPartitionerFactory
> >> >> >> \
> >> >> >> -ca giraph.numInputThreads=1 \
> >> >> >> -ca giraph.inputSplitSamplePercent=20 \
> >> >> >> -ca giraph.pr.maxNeighborsPerVertex=50 \
> >> >> >> -ca
> >> >> >> giraph.partitionClass=org.apache.giraph.partition.
> ByteArrayPartition
> >> >> >> \
> >> >> >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \
> >> >> >> -ca
> >> >> >>
> >> >> >>
> >> >> >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/
> _bsp/_partitions
> >> >> >>
> >> >> >> Logs excerpt:
> >> >> >>
> >> >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in
> >> >> >> worker
> >> >> >> context
> >> >> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60
> >> >> >> partitions
> >> >> >> to process with 1 compute thread(s), originally 1 thread(s) on
> >> >> >> superstep 1
> >> >> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with
> 60
> >> >> >> partitions in memory and 1 active threads
> >> >> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in
> PR
> >> >> >> computation
> >> >> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.75
> >> >> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
> >> >> >> updateActiveThreadsFraction: updating the number of active threads
> >> >> >> to
> >> >> >> 1
> >> >> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
> >> >> >> updateRequestsCredit: updating the credit to 20
> >> >> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager:
> installGCMonitoring:
> >> >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
> >> >> >> Failure, duration = 937ms
> >> >> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.72
> >> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.74
> >> >> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
> >> >> >> updateActiveThreadsFraction: updating the number of active threads
> >> >> >> to
> >> >> >> 1
> >> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
> >> >> >> updateRequestsCredit: updating the credit to 20
> >> >> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.76
> >> >> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine:
> doneProcessingPartition:
> >> >> >> processing partition 234 is done!
> >> >> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.79
> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
> >> >> >> updateActiveThreadsFraction: updating the number of active threads
> >> >> >> to
> >> >> >> 1
> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
> >> >> >> updateRequestsCredit: updating the credit to 18
> >> >> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server
> window
> >> >> >> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459,
> ave
> >> >> >> received req MBytes = 0.0138, secs waited = 30.058
> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.82
> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> >> >> next
> >> >> >> IO command is: StorePartitionIOCommand: (partitionId = 234)
> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> >> >> command StorePartitionIOCommand: (partitionId = 234) completed:
> >> >> >> bytes=
> >> >> >> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC
> >> >> >> time)=175.03
> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.83
> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> >> >> next
> >> >> >> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234)
> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> >> >> command StoreIncomingMessageIOCommand: (partitionId = 234)
> >> >> >> completed:
> >> >> >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC
> >> >> >> time)=NaN
> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.83
> >> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager:
> installGCMonitoring:
> >> >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
> >> >> >> Failure, duration = 3107ms
> >> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager:
> installGCMonitoring:
> >> >> >> name = PS MarkSweep, action = end of major GC, cause = Ergonomics,
> >> >> >> duration = 15064ms
> >> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
> >> >> >> updateActiveThreadsFraction: updating the number of active threads
> >> >> >> to
> >> >> >> 1
> >> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
> >> >> >> updateRequestsCredit: updating the credit to 20
> >> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
> >> >> >> getNextIOActions:
> >> >> >> usedMemoryFraction = 0.71
> >> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> >> >> next
> >> >> >> IO command is: LoadPartitionIOCommand: (partitionId = 234,
> superstep
> >> >> >> =
> >> >> >> 2)
> >> >> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016
> >> >> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine:
> doneProcessingPartition:
> >> >> >> processing partition 364 is done!
> >> >> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
> >> >> >> updateActiveThreadsFraction: updating the number of active threads
> >> >> >> to
> >> >> >> 1
> >> >> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
> >> >> >> updateRequestsCredit: updating the credit to 20
> >> >> >> --
> >> >> >> -- num     #instances         #bytes  class name
> >> >> >> -- ----------------------------------------------
> >> >> >> --   1:     224004229    10752202992
> >> >> >> java.util.concurrent.ConcurrentHashMap$Node
> >> >> >> --   2:      19751666     6645730528  [B
> >> >> >> --   3:     222135985     5331263640
> >> >> >> com.belprime.di.pr.DomainPartAwareLongWritable
> >> >> >> --   4:     214686483     5152475592
> >> >> >> org.apache.hadoop.io.DoubleWritable
> >> >> >> --   5:           353     4357261784
> >> >> >> [Ljava.util.concurrent.ConcurrentHashMap$Node;
> >> >> >> --   6:        486266      204484688  [I
> >> >> >> --   7:       6017652      192564864
> >> >> >> org.apache.giraph.utils.UnsafeByteArrayOutputStream
> >> >> >> --   8:       3986203      159448120
> >> >> >> org.apache.giraph.utils.UnsafeByteArrayInputStream
> >> >> >> --   9:       2064182      148621104
> >> >> >> org.apache.giraph.graph.ByteValueVertex
> >> >> >> --  10:       2064182       82567280
> >> >> >> org.apache.giraph.edge.ByteArrayEdges
> >> >> >> --  11:       1886875       45285000  java.lang.Integer
> >> >> >> --  12:        349409       30747992
> >> >> >> java.util.concurrent.ConcurrentHashMap$TreeNode
> >> >> >> --  13:        916970       29343040  java.util.Collections$1
> >> >> >> --  14:        916971       22007304
> >> >> >> java.util.Collections$SingletonSet
> >> >> >> --  15:         47270        3781600
> >> >> >> java.util.concurrent.ConcurrentHashMap$TreeBin
> >> >> >> --  16:         26201        2590912  [C
> >> >> >> --  17:         34175        1367000
> >> >> >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
> >> >> >> --  18:          6143        1067704  java.lang.Class
> >> >> >> --  19:         25953         830496  java.lang.String
> >> >> >> --  20:         34175         820200
> >> >> >> org.apache.giraph.edge.EdgeNoValue
> >> >> >> --  21:          4488         703400  [Ljava.lang.Object;
> >> >> >> --  22:            70         395424
> >> >> >> [Ljava.nio.channels.SelectionKey;
> >> >> >> --  23:          2052         328320  java.lang.reflect.Method
> >> >> >> --  24:          6600         316800
> >> >> >> org.apache.giraph.utils.ByteArrayVertexIdMessages
> >> >> >> --  25:          5781         277488  java.util.HashMap$Node
> >> >> >> --  26:          5651         271248  java.util.Hashtable$Entry
> >> >> >> --  27:          6604         211328
> >> >> >> org.apache.giraph.factories.DefaultMessageValueFactory
> >> >> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of
> >> >> >> callable failed
> >> >> >> java.lang.RuntimeException: call: execution of IO command
> >> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> >> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >> >> >> at java.lang.Thread.run(Thread.java:745)
> >> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
> >> >> >> underflow.
> >> >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> >> >> >> at
> >> >> >>
> >> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(
> UnsafeInput.java:112)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> >> >> >> at
> >> >> >> org.apache.hadoop.io.LongWritable.readFields(
> LongWritable.java:47)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.
> loadPartitionDataProxy(DiskBackedDataStore.java:234)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadPartitionData(DiskBackedPartitionStore.java:311)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> >> >> >> ... 6 more
> >> >> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException:
> >> >> >> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution
> >> >> >> of
> >> >> >> IO command LoadPartitionIOCommand: (partitionId = 234, superstep =
> >> >> >> 2)
> >> >> >> failed!, exiting...
> >> >> >> java.lang.RuntimeException: call: execution of IO command
> >> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> >> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >> >> >> at java.lang.Thread.run(Thread.java:745)
> >> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
> >> >> >> underflow.
> >> >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> >> >> >> at
> >> >> >>
> >> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(
> UnsafeInput.java:112)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> >> >> >> at
> >> >> >> org.apache.hadoop.io.LongWritable.readFields(
> LongWritable.java:47)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.
> loadPartitionDataProxy(DiskBackedDataStore.java:234)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadPartitionData(DiskBackedPartitionStore.java:311)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> >> >> >> at
> >> >> >>
> >> >> >>
> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> >> >> >> ... 6 more
> >> >> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth:
> >> >> >> Got
> >> >> >> failure, unregistering health on
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> /_hadoopBsp/giraph_yarn_application_1478342673283_
> 0009/_applicationAttemptsDir/0/_superstepDir/1/_
> workerHealthyDir/datanode6.webmeup.com_5
> >> >> >> on superstep 1
> >> >> >>
> >> >> >> We looked into one thread
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/
> 201607.mbox/%3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%
> 3DoLJAAPQw%40mail.gmail.com%3E
> >> >> >> but it is rather old and at that time the answer was "do not use
> it
> >> >> >> yet".
> >> >> >> (see reply
> >> >> >>
> >> >> >>
> >> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/
> 201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%
> 40mail.gmail.com%3E).
> >> >> >> Does it hold today? We would like to use new advanced adaptive OOC
> >> >> >> approach if possible...
> >> >> >>
> >> >> >> Thank you in advance, any help or hint would be really
> appreciated.
> >> >> >>
> >> >> >> Best Regards,
> >> >> >> Denis Dudinski
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>
>

Mime
View raw message