Yes. I think what Sergey meant by that is the OOC is capab= le of even spilling 90% of the graph to disk, just to give an example to sh= ow OOC is not limited to memory.

In your case, where you= have a 1TB graph and a 10TB disk space, OOC would let the computation to f= inish just fine. Although, be aware that the more data goes on disk the mor= e time spent in reading them back in memory. So, for instance, if you have = 1TB graph size and 100GB memory size and you are running on a single machin= e, that means 90% of the graph is going on disk. If your computation per ve= rtex is not too heavy (which is usually the case), the execution time will = be bounded by disk operations. Let's say you are using a disk with 150M= B/s bandwidth (a good HDD). In the example I mentioned, each superstep=C2= =A0would need 900GB to be read and also written to the disk. That's rou= ghly 3.5 hours per superstep.

Best,
Hassan

On = Wed, Nov 9, 2016 at 11:44 AM, Hai Lan <lanhai1988@gmail.com> wrote:
Hello Hassan<= div>
The 90% is mentioned by Sergey Edunov said "speaking of out of core, we tried to spill up to = 90% of the graph=C2=A0to the disk.<= /span>=C2=A0". So I guess it m= ight means OOC is still limited by memory size if the input graph size is o= ver 10 times than memory size. By reading your response, just double check = if the disk size is larger than the input graph, like 1Tb graph and 10Tb di= sk space, it should be able to run it, correct?

Th= anks again

Best,

Hai

On Wed, Nov 9, 2016 at 12:33 PM, Hassan Esla= mi wrote:
Hi Hai,

1. One of the goals i= n having adaptive mechanism was to make OOC faster than cases where you spe= cify the number of partitions explicitly. In particular, if you don't k= now exactly how much should be the number of partitions, you may end up set= ting it to a pessimistic number and not taking advantage of the entire avai= lable memory. That being said, the adaptive mechanism should always be pref= erred if you are aiming for higher performance. Also, the adaptive mechanis= m avoids OOM failures due to message overflow. That means the adaptive mech= anism also provide a higher robustness.

2. I don&#= 39;t understand where the 90% you are mentioning comes from. In my example = in the other email, the 90% was for the suggested size of tenured memory si= ze (to reduce GC overhead). And OOC mechanism works independently of how mu= ch memory is available. There are two fundamental limits for OOC though: a)= OOC assumes one partition and its messages can fit entirely in memory. So,= if partitions are large and any of them won't fit in memory, you shoul= d increase the number of partitions. b) OOC is limited to the "disk&qu= ot; size on each machine. If the amount of data on each machine exceeds the= "disk" size, OOC will fail. In that case, you should use more ma= chines or decrease your graph size somehow.

Best,<= /div>
Hassan

<= div class=3D"gmail_quote">On Wed, Nov 9, 2016 at 9:30 AM, Hai Lan wrote:
Many thanks Hassan

I did test a fixed num= ber of partitions without=C2=A0isStaticGra= ph=3Dtrue and it can work great.

I'll fol= low your instruction to test adaptive mechanism then. But I have two small = questions:

1. Is there any difference in perf= ormance=C2=A0aspect of using fixed number setting and adaptive setting?

2. As I known, the out of core can only split up= to 90% input graph into disk. Does that mean for example, a 10 Tb graph ca= n be processed with at least 1 Tb available memory?

Thanks again,

=
Best,
<= span style=3D"font-size:12.8px">
Hai

On Tue, Nov 8, 2016 at 12:42 PM, Hassan Esla= mi wrote:
Hi Hai,

I notice that you are= trying to use the new OOC mechanism too. Here is my take on your issue:

As mentioned earlier in the thread, we noticed there= is a bug with "isStaticGraph=3Dtrue" option. This is a flag only= for optimization purposes. I'll create a JIRA and send a fix for it, b= ut for now, please run your job without this flag. This should help you pas= s the first superstep.

As for the adaptive mechani= sm vs. fixed number of partitions, both approaches are now acceptable in th= e new OOC design. If you add "giraph.= maxPartitionsInMemory" the OOC= infrastructure assumes that you are using fixed number of partitions in me= mory and ignores any other OOC-related flags in your command. This is done = to be backward compatible with existing codes depending on OOC in the previ= ous version. But, be advised that using this type of out-of-core execution = WILL NOT prevent your job from failures due to spikes in messages. Also, yo= u have to make sure that the number you specify as the number of partitions= in memory is set in a way that your specified number of partitions and the= ir messages will fit in your available memory.

On the other hand, I encourage you to use the adaptive mechanism in which= you do not have to mention the number of partitions in memory, and the OOC= mechanism underneath will figure things out automatically. To use the adap= tive mechanism, you should use the following flags:
giraph.useOutOfCoreGraph=3Dtrue
=
giraph.waitForRequestsConfirmati= on=3Dfalse
giraph.wai= tForPerWorkerRequests=3Dtrue

I know = the naming for the flags here is a bit bizarre, but this sets up the infras= tructure for message flow control which is crucial to avoid failures due to= messages. The default strategy for the adaptive mechanism is threshold bas= ed. Meaning=C2=A0that, there are a bunch of thresholds (default values for = the threshold are defined in ThresholdBasedOracle class) the system reacts = to those. You should follow some (fairly=C2=A0easy) guidelines to set the p= roper threshold for your system. Please refer to the other email response i= n the same thread for guidelines on how to set your thresholds properly.

Hope it helps,
Best,
Hass= an

On = Tue, Nov 8, 2016 at 11:01 AM, Hai Lan <lanhai1988@gmail.com> wrote:
Hello Denis

I just tested to set the timeout as 3600000. And it seems like supers= tep 0 can be finished now. However, the job is killed immediately=C2=A0when= superstep 1 start. In zookeeper log:

```2016-11-08 11:54:13,569 IN=
pServiceMaster: checkWorkers: Only found 198 responses of 199 needed t=
o start superstep 1.  Reporting every 30000 msecs, 511036 more msecs left b=
efore giving up.
rg.apache.giraph.master.BspServiceMaster: logMissingWorkersOnSuperstep=
: No response from partition 13 (could be master)
2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba30000 type:create cxid:0x14e81 zxid:0xc76 =
applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir Error:=
_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba30000 type:create cxid:0x14e82 zxid:0xc77 =
applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir Erro=
1/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
2016-11-08 11:54:21,045 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba30000 type:create cxid:0x14f4b zxid:0xc79 =
applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir Error:=
_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
2016-11-08 11:54:21,046 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba30000 type:create cxid:0x14f4c zxid:0xc7a =
applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir Erro=
1/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
ully added 0 connections, (0 total connected) 0 failed, 0 failures total.
a href=3D"http://org.apache.giraph.partition.Pa" target=3D"_blank">org.apac=
he.giraph.partition.PartitionBalancer: balancePartitionsAcrossWork=
ers: Using algorithm static
a href=3D"http://org.apache.giraph.partition.Pa" target=3D"_blank">org.apac=
he.giraph.partition.PartitionUtils: analyzePartitionStats: [Worker=
(hostname=3Dt=
=3D22, port=3D30022):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor08.umiacs.umd.e=
8825003, e=3D0),Worker(hostname=3Dtrantor17.umiacs.umd.edu hostOrIp=3Dtrantor17.umiacs.umd=
stname=3Dtran=
9, port=3D30099):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor06.umiacs.umd.edu hostOrIp=3Dtrantor06.umiacs.umd.edu, MRtaskID=3D159, port=3D30159):(v=3D4882=
4999, e=3D0),Worker(hostname=3Dtrantor12.umiacs.umd.edu hostOrIp=3Dtrantor12.umiacs.umd.edu, MRtaskID=3D189, port=3D30189):(v=3D48824999, e=3D0),Worker(hostna=
me=3Dtrantor2=
port=3D30166):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor13.umiacs.umd.edu =
hostOrIp=3Dtr=
9, e=3D0),Worker(hostname=3Dtrantor04.umiacs.umd.edu hostOrIp=3Dtrantor04.umiacs.umd.ed=
=3Dtrantor23<=
rt=3D30116):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor14.umiacs.umd.edu ho=
stOrIp=3Dtran=
e=3D0),Worker(hostname=3Dtrantor04.umiacs.umd.edu hostOrIp=3Dtrantor04.umiacs.umd.edu<=
30123):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor08.umiacs.umd.edu hostOrI=
p=3Dtrantor08=
),Worker(hostname=3Dtrantor12.umiacs.umd.edu hostOrIp=3Dtrantor12.umiacs.umd.edu, M=
5):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor13.umiacs.umd.edu hostOrIp=3D=
trantor13.umi=
orker(hostname=3Dtrantor20.umiacs.umd.edu hostOrIp=3Dtrantor20.umiacs.umd.edu, MRta=
skID=3D23, port=3D30023):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor23.umiacs.u=
v=3D48824999, e=3D0),Worker(hostname=3Dtrantor20.umiacs.umd.edu hostOrIp=3Dtrantor20.umiacs.=
(hostname=3Dt=
=3D89, port=3D30089):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor08.umiacs.umd.e=
8824999, e=3D0),Worker(hostname=3Dtrantor21.umiacs.umd.edu hostOrIp=3Dtrantor21.umiacs.umd=
ostname=3Dtra=
187, port=3D30187):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor05.umiacs.umd.edu=
824999, e=3D0),Worker(hostname=3Dtrantor23.umiacs.umd.edu hostOrIp=3Dtrantor23.umiacs.umd.edu, MRtaskID=3D118, port=3D30118):(v=3D48824999, e=3D0),Worker(host=
name=3Dtranto=
port=3D30075):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor14.umiacs.umd.edu=
hostOrIp=3Dt=
99, e=3D0),Worker(hostname=3Dtrantor20.umiacs.umd.edu hostOrIp=3Dtrantor20.umiacs.umd.e=
=3Dtrantor17<=
t=3D30088):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor05.umiacs.umd.edu hos=
tOrIp=3Dtrant=
0076):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor23.umiacs.umd.edu hostOrIp=
=3Dtrantor23.=
),Worker(hostname=3Dtrantor21.umiacs.umd.edu hostOrIp=3Dtrantor21.umiacs.umd.edu, M=
3):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor04.umiacs.umd.edu hostOrIp=3D=
trantor04.umi=
orker(hostname=3Dtrantor21.umiacs.umd.edu hostOrIp=3Dtrantor21.umiacs.umd.edu, MRta=
skID=3D170, port=3D30170):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor18.umiac=
):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor06.umiacs.umd.edu hostOrIp=3D<=
a href=3D"http://trantor06.umiacs.umd.edu" target=3D"_blank">trantor06.umia=
rker(hostname=3Dtrantor23.umiacs.umd.edu hostOrIp=3Dtrantor23.umiacs.umd.edu, MRtas=
kID=3D120, port=3D30120):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor14.umiacs.u=
v=3D48824999, e=3D0),Worker(hostname=3Dtrantor09.umiacs.umd.edu hostOrIp=3Dtrantor09.umiacs.=
(hostname=3Dt=
=3D59, port=3D30059):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor17.umiacs.umd.e=
8824999, e=3D0),Worker(hostname=3Dtrantor20.umiacs.umd.edu hostOrIp=3Dtrantor20.umiacs.umd=
stname=3Dtran=
02, port=3D30102):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor21.umiacs.umd.edu<=
24999, e=3D0),Worker(hostname=3Dtrantor07.umiacs.umd.edu hostOrIp=3Dtrantor07.umiacs.umd.edu, MRtaskID=3D34, port=3D30034):(v=3D48825003, e=3D0),Worker(hostna=
me=3Dtrantor0=
port=3D30162):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor06.umiacs.umd.edu =
hostOrIp=3Dtr=
9, e=3D0),Worker(hostname=3Dtrantor17.umiacs.umd.edu hostOrIp=3Dtrantor17.umiacs.umd.ed=
=3Dtrantor14<=
rt=3D30151):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor23.umiacs.umd.edu ho=
stOrIp=3Dtran=
e=3D0),Worker(hostname=3Dtrantor19.umiacs.umd.edu hostOrIp=3Dtrantor19.umiacs.umd.edu<=
=3D30101):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor06.umiacs.umd.edu host=
OrIp=3Dtranto=
a href=3D"http://trantor06.umiacs.umd.edu" target=3D"_blank">trantor06=
=3D30158):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor14.umiacs.umd.edu host=
OrIp=3Dtranto=
30140):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor24.umiacs.umd.edu hostOrI=
p=3Dtrantor24=
),Worker(hostname=3Dtrantor18.umiacs.umd.edu hostOrIp=3Dtrantor18.umiacs.umd.edu, M=
0):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor14.umiacs.umd.edu hostOrIp=3D=
trantor14.umi=
orker(hostname=3Dtrantor17.umiacs.umd.edu hostOrIp=3Dtrantor17.umiacs.umd.edu, MRta=
skID=3D85, port=3D30085):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor22.umiacs.u=
v=3D48824999, e=3D0),Worker(hostname=3Dtrantor03.umiacs.umd.edu hostOrIp=3Dtrantor03.umiacs.=
(hostname=3Dt=
=3D80, port=3D30080):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor24.umiacs.umd.e=
8824999, e=3D0),Worker(hostname=3Dtrantor16.umiacs.umd.edu hostOrIp=3Dtrantor16.umiacs.umd=
ostname=3Dtra=
147, port=3D30147):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor05.umiacs.umd.edu=
824999, e=3D0),Worker(hostname=3Dtrantor11.umiacs.umd.edu hostOrIp=3Dtrantor11.umiacs.umd.edu, MRtaskID=3D8, port=3D30008):(v=3D48825004, e=3D0),Worker(hostna=
me=3Dtrantor0=
ort=3D30045):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor08.umiacs.umd.edu h=
ostOrIp=3Dtra=
e=3D0),Worker(hostname=3Dtrantor07.umiacs.umd.edu hostOrIp=3Dtrantor07.umiacs.umd.edu<=
a href=3D"http://trantor18.umiacs.umd.edu" target=3D"_blank">trantor18=
=3D30106):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor01.umiacs.umd.edu host=
OrIp=3Dtranto=
a href=3D"http://trantor09.umiacs.umd.edu" target=3D"_blank">trantor09=
30071):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor10.umiacs.umd.edu hostOrI=
p=3Dtrantor10=
),Worker(hostname=3Dtrantor15.umiacs.umd.edu hostOrIp=3Dtrantor15.umiacs.umd.edu, M=
):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor16.umiacs.umd.edu hostOrIp=3D<=
a href=3D"http://trantor16.umiacs.umd.edu" target=3D"_blank">trantor16.umia=
rker(hostname=3Dtrantor24.umiacs.umd.edu hostOrIp=3Dtrantor24.umiacs.umd.edu, MRtas=
kID=3D93, port=3D30093):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor03.umiacs.um=
=3D48825003, e=3D0),Worker(hostname=3Dtrantor11.umiacs.umd.edu hostOrIp=3Dtrantor11.umiacs.u=
ostname=3Dtra=
185, port=3D30185):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor02.umiacs.umd.edu=
25003, e=3D0),Worker(hostname=3Dtrantor07.umiacs.umd.edu hostOrIp=3Dtrantor07.umiacs.umd.edu, MRtaskID=3D33, port=3D30033):(v=3D48825003, e=3D0),Worker(hostna=
me=3Dtrantor1=
port=3D30105):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor01.umiacs.umd.edu =
hostOrIp=3Dtr=
, e=3D0),Worker(hostname=3Dtrantor22.umiacs.umd.edu hostOrIp=3Dtrantor22.umiacs.umd.edu=
=3Dtrantor09<=
t=3D30070):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor19.umiacs.umd.edu hos=
tOrIp=3Dtrant=
a href=3D"http://trantor24.umiacs.umd.edu" target=3D"_blank">trantor24=
30094):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor16.umiacs.umd.edu hostOrI=
p=3Dtrantor16=
0),Worker(hostname=3Dtrantor02.umiacs.umd.edu hostOrIp=3Dtrantor02.umiacs.umd.edu, =
):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor15.umiacs.umd.edu hostOrIp=3D<=
a href=3D"http://trantor15.umiacs.umd.edu" target=3D"_blank">trantor15.umia=
rker(hostname=3Dtrantor22.umiacs.umd.edu hostOrIp=3Dtrantor22.umiacs.umd.edu, MRtas=
kID=3D144, port=3D30144):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor09.umiacs.u=
=3D48824999, e=3D0),Worker(hostname=3Dtrantor12.umiacs.umd.edu hostOrIp=3Dtrantor12.umiacs.u=
(hostname=3Dt=
=3D10, port=3D30010):(v=3D48825004, e=3D0),Worker(hostname=3Dtrantor19.umiacs.umd.e=
48824999, e=3D0),Worker(hostname=3Dtrantor18.umiacs.umd.edu hostOrIp=3Dtrantor18.umiacs.umd=
ostname=3Dtra=
61, port=3D30061):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor03.umiacs.umd.edu<=
5003, e=3D0),Worker(hostname=3Dtrantor10.umiacs.umd.edu hostOrIp=3Dtrantor10.umiacs.umd.edu, MRtaskID=3D42, port=3D30042):(v=3D48825003, e=3D0),Worker(hostnam=
e=3Dtrantor13=
ort=3D30178):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor03.umiacs.umd.edu h=
ostOrIp=3Dtra=
e=3D0),Worker(hostname=3Dtrantor19.umiacs.umd.edu hostOrIp=3Dtrantor19.umiacs.umd.edu<=
=3D30113):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor02.umiacs.umd.edu host=
OrIp=3Dtranto=
a href=3D"http://trantor07.umiacs.umd.edu" target=3D"_blank">trantor07=
30031):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor09.umiacs.umd.edu hostOrI=
p=3Dtrantor09=
),Worker(hostname=3Dtrantor11.umiacs.umd.edu hostOrIp=3Dtrantor11.umiacs.umd.edu, M=
:(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor22.umiacs.umd.edu hostOrIp=3Dtrantor22.umiac=
ID=3D133, port=3D30133):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor01.umiacs.um=
=3D48824999, e=3D0),Worker(hostname=3Dtrantor15.umiacs.umd.edu hostOrIp=3Dtrantor15.umiacs.u=
(hostname=3Dt=
=3D177, port=3D30177):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor10.umiacs.umd.=
48825003, e=3D0),Worker(hostname=3Dtrantor04.umiacs.umd.edu hostOrIp=3Dtrantor04.umiacs.umd=
ostname=3Dtra=
193, port=3D30193):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor22.umiacs.umd.edu=
824999, e=3D0),Worker(hostname=3Dtrantor01.umiacs.umd.edu hostOrIp=3Dtrantor01.umiacs.umd.edu, MRtaskID=3D66, port=3D30066):(v=3D48824999, e=3D0),Worker(hostn=
ame=3Dtrantor=
port=3D30049):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor07.umiacs.umd.edu =
hostOrIp=3Dtr=
, e=3D0),Worker(hostname=3Dtrantor11.umiacs.umd.edu hostOrIp=3Dtrantor11.umiacs.umd.edu=
a href=3D"http://trantor20.umiacs.umd.edu" target=3D"_blank">trantor20=
30026):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor05.umiacs.umd.edu hostOrI=
p=3Dtrantor05=
0),Worker(hostname=3Dtrantor08.umiacs.umd.edu hostOrIp=3Dtrantor08.umiacs.umd.edu, =
):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor03.umiacs.umd.edu hostOrIp=3D<=
a href=3D"http://trantor03.umiacs.umd.edu" target=3D"_blank">trantor03.umia=
ID=3D36, port=3D30036):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor13.umiacs.umd=
=3D48824999, e=3D0),Worker(hostname=3Dtrantor16.umiacs.umd.edu hostOrIp=3Dtrantor16.umiacs.u=
(hostname=3Dt=
=3D77, port=3D30077):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor15.umiacs.umd.e=
48824999, e=3D0),Worker(hostname=3Dtrantor19.umiacs.umd.edu hostOrIp=3Dtrantor19.umiacs.umd=
ostname=3Dtra=
197, port=3D30197):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor17.umiacs.umd.edu=
24999, e=3D0),Worker(hostname=3Dtrantor07.umiacs.umd.edu hostOrIp=3Dtrantor07.umiacs.umd.edu, MRtaskID=3D29, port=3D30029):(v=3D48825003, e=3D0),Worker(hostna=
me=3Dtrantor1=
port=3D30192):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor02.umiacs.umd.edu =
hostOrIp=3Dtr=
, e=3D0),Worker(hostname=3Dtrantor11.umiacs.umd.edu hostOrIp=3Dtrantor11.umiacs.umd.edu=
a href=3D"http://trantor08.umiacs.umd.edu" target=3D"_blank">trantor08=
30056):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor24.umiacs.umd.edu hostOrI=
p=3Dtrantor24=
),Worker(hostname=3Dtrantor03.umiacs.umd.edu hostOrIp=3Dtrantor03.umiacs.umd.edu, M=
):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor10.umiacs.umd.edu hostOrIp=3D<=
a href=3D"http://trantor10.umiacs.umd.edu" target=3D"_blank">trantor10.umia=
ID=3D78, port=3D30078):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor13.umiacs.umd=
=3D48824999, e=3D0),Worker(hostname=3Dtrantor00.umiacs.umd.edu hostOrIp=3Dtrantor00.umiacs.u=
hostname=3Dtr=
=3D107, port=3D30107):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor09.umiacs.umd.=
48824999, e=3D0),Worker(hostname=3Dtrantor02.umiacs.umd.edu hostOrIp=3Dtrantor02.umiacs.umd=
stname=3Dtran=
, port=3D30001):(v=3D58589999, e=3D0),Worker(hostname=3Dtrantor23.umiacs.umd.edu hostOrIp=3D=
999, e=3D0),Worker(hostname=3Dtrantor10.umiacs.umd.edu hostOrIp=3Dtrantor10.umiacs.umd.=
=3Dtrantor16<=
rt=3D30110):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor03.umiacs.umd.edu ho=
stOrIp=3Dtran=
a href=3D"http://trantor11.umiacs.umd.edu" target=3D"_blank">trantor11=
0006):(v=3D48825004, e=3D0),Worker(hostname=3Dtrantor12.umiacs.umd.edu hostOrIp=
=3Dtrantor12.=
),Worker(hostname=3Dtrantor05.umiacs.umd.edu hostOrIp=3Dtrantor05.umiacs.umd.edu, M=
4):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor24.umiacs.umd.edu hostOrIp=3D=
trantor24.umi=
rker(hostname=3Dtrantor21.umiacs.umd.edu hostOrIp=3Dtrantor21.umiacs.umd.edu, MRtas=
kID=3D164, port=3D30164):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor01.umiacs.u=
=3D48824999, e=3D0),Worker(hostname=3Dtrantor20.umiacs.umd.edu hostOrIp=3Dtrantor20.umiacs.u=
hostname=3Dtr=
=3D79, port=3D30079):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor09.umiacs.umd.e=
8824999, e=3D0),Worker(hostname=3Dtrantor19.umiacs.umd.edu hostOrIp=3Dtrantor19.umiacs.umd=
ostname=3Dtra=
81, port=3D30081):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor11.umiacs.umd.edu<=
001, e=3D0),Worker(hostname=3Dtrantor12.umiacs.umd.edu hostOrIp=3Dtrantor12.umiacs.umd.=
e=3Dtrantor07=
rt=3D30027):(v=3D48825003, e=3D0),Worker(hostname=3Dtrantor10.umiacs.umd.edu ho=
stOrIp=3Dtran=
a href=3D"http://trantor22.umiacs.umd.edu" target=3D"_blank">trantor22=
=3D30146):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor08.umiacs.umd.edu host=
OrIp=3Dtranto=
30183):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor13.umiacs.umd.edu hostOrI=
p=3Dtrantor13=
0),Worker(hostname=3Dtrantor20.umiacs.umd.edu hostOrIp=3Dtrantor20.umiacs.umd.edu, =
):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor03.umiacs.umd.edu hostOrIp=3D<=
a href=3D"http://trantor03.umiacs.umd.edu" target=3D"_blank">trantor03.umia=
ID=3D125, port=3D30125):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor16.umiacs.um=
=3D48824999, e=3D0),Worker(hostname=3Dtrantor21.umiacs.umd.edu hostOrIp=3Dtrantor21.umiacs.u=
(hostname=3Dt=
=3D72, port=3D30072):(v=3D48824999, e=3D0),Worker(hostname=3Dtrantor19.umiacs.umd.e=
48824999, e=3D0),]
a href=3D"http://org.apache.giraph.partition.Pa" target=3D"_blank">org.apac=
he.giraph.partition.PartitionUtils: analyzePartitionStats: Vertice=
s - Mean: 49070351, Min: Worker(hostname=3Dtrantor17.umiacs.umd.edu hostOrIp=3D=
trantor17.umi=
acs.umd.edu, MRtaskID=3D87, port=3D30087) - 48824999, Max: Worker(=
hostname=3Dtr=
=3D5, port=3D30005) - 58590001
a href=3D"http://org.apache.giraph.partition.Pa" target=3D"_blank">org.apac=
he.giraph.partition.PartitionUtils: analyzePartitionStats: Edges -=
Mean: 0, Min: Worker(hostname=3Dtrantor17.umiacs.umd.edu hostOrIp=3Dtrantor17.umiacs.umd.edu, MRtaskID=3D87, port=3D30087) - 0, Max: Worker(hostname=3Dtrantor11.umia=
- 0
rg.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out o=
f 199 workers finished on superstep 1 on path /_hadoopBsp/job_1477020594559=
_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerFinishedDir
rg.apache.giraph.master.BspServiceMaster: setJobState: {"_applica=
tionAttemptKey":-1,"_stateKey":"FAILED","=
;_superstepKey":-1} on superstep 1
2016-11-08 11:54:29,094 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba30044 type:create cxid:0x1b zxid:0xd46 txn=
terJobState Error:KeeperErrorCode =3D NodeExists for /_hadoopBsp/job_147702=
0594559_0051/_masterJobState
rg.apache.giraph.master.BspServiceMaster: setJobState: {"_applica=
tionAttemptKey":-1,"_stateKey":"FAILED","=
;_superstepKey":-1}
2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba3004a type:create cxid:0x1b zxid:0xd47 txn=
terJobState Error:KeeperErrorCode =3D NodeExists for /_hadoopBsp/job_147702=
0594559_0051/_masterJobState
2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba3004f type:create cxid:0x1b zxid:0xd48 txn=
terJobState Error:KeeperErrorCode =3D NodeExists for /_hadoopBsp/job_147702=
0594559_0051/_masterJobState
2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper=
.server.PrepRequestProcessor: Got user-level KeeperException when =
processing sessionid:0x15844b61ba30054 type:create cxid:0x1b zxid:0xd49 txn=
terJobState Error:KeeperErrorCode =3D NodeExists for /_hadoopBsp/job_147702=
0594559_0051/_masterJobState
org.apache.giraph.master.BspServiceMaster: failJob: Killing job job_14=
77020594559_0051```

```=
Any other ideas?```

```=
Thanks,```
`BR,`
=

`Hai`

=

On Tue, Nov 8, 2016 at 9:48 AM, Denis Dudins= ki wrote:
Hi Hai,

I think we saw something like this in our environment.

Interesting row is this one:
2016-10-27 19:04:00,000 INFO [SessionTracker]
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x158084f5b2100b8, timeout of 600000ms exceeded

I think that one of workers due to some reason did not communicate with ZooKeeper for quite a long time (it may be heavy network load or
high CPU consumption, see your monitoring infrastructure, it should
give you a hint). ZooKeeper session expires and all ephemeral nodes
for that worker in ZooKeeper tree are deleted. Master thinks that
worker is dead and halts computation.

Your ZooKeeper session timeout is 600000 ms which is 10 minutes. We
set this value to much more higher value equal to 1 hour and were able
to perform computations successfully.

I hope it will help in your case too.

Best Regards,
Denis Dudinski

2016-11-08 16:43 GMT+03:00 Hai Lan <lanhai1988@gmail.com>:
> Hi Guys
>
> The OutOfMemoryError might be solved be adding
> "-Dmapreduce.map.memory.mb=3D14848". But in my tests, I= found some more
> problems during running out of core graph.
>
> I did two tests with 150G 10^10 vertices input in 1.2 version, and it = seems
> like it not necessary to add like
> "giraph.userPartitionCount=3D1000,giraph.maxPartitionsInMemo= ry=3D1" cause it is
> adaptive. However, If I run without setting "userPartitionCount a= nd
> maxPartitionsInMemory", it will it will keep running on superstep= -1
> forever. None of worker can finish superstep -1. And I can see a warn = in
> zookeeper log, not sure if it is the problem:
>
> WARN [netty-client-worker-3]
> org.apache.giraph.comm.netty.handler.ResponseClientHandler: excep= tionCaught:
> Channel failed with remote address
> trantor21.umiacs.umd.edu/192.168.74.= 221:30172
> java.lang.ArrayIndexOutOfBoundsException: 1075052544
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
> org.apache.giraph.comm.flow_control.NoOpFlowControl.getAckSi= gnalFlag(NoOpFlowControl.java:52)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRea= d(DefaultChannelHandlerContext.java:324)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRea= d(DefaultChannelHandlerContext.java:324)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRea= d(DefaultChannelHandlerContext.java:324)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven= tLoop.java:485)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimiz= ed(NioEventLoop.java:452)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at io.netty.channel.nio.NioEventLoop.ru= n(NioEventLoop.java:346)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0at
>
>
>
> If I add giraph.userPartitionCount=3D1000,giraph.maxPartitionsInM= emory=3D1.
> Whole command is :
>
> org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreGraph=3Dtrue<= br> > -Ddigraph.block_factory_configurators=3Dorg.apache.giraph.conf.FacebookConfiguration
> org.apache.giraph.examples.LongFloatNullTextInputFormat -vip
> /user/hlan/cube/tmp/out/ -vof
> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
> /user/hlan/output -w 199 -ca
>
> the job will be pass superstep -1 very quick (around 10 mins). But it = will
> be killed near end of superstep 0.
>
> 2016-10-27 18:53:56,607 INFO [org.apache.giraph.master.MasterThre= ad]
> or= g.apache.giraph.partition.PartitionUtils: analyzePartitionStats: V= ertices
> - Mean: 9810049, Min: Worker(hostname=3Dtrantor11.umiacs.um= d.edu
> hostOrIp=3Dtrantor11.umiacs.umd.edu, MRtaskID=3D10, por= t=3D30010) - 9771533, Max:
> Worker(hostname=3Dtrantor02.umiacs.umd.edu hostOrIp=3D<= a href=3D"http://trantor02.umiacs.umd.edu" rel=3D"noreferrer" target=3D"_bl= ank">trantor02.umiacs.umd.edu,
> 2016-10-27 18:53:56,608 INFO [org.apache.giraph.master.MasterThre= ad]
> or= g.apache.giraph.partition.PartitionUtils: analyzePartitionStats: E= dges -
> Mean: 0, Min: Worker(hostname=3Dtrantor11.umiacs.umd.edu
> hostOrIp=3D
trantor11.umiacs.umd.edu, MRtaskID=3D10, por= t=3D30010) - 0, Max:
> Worker(hostname=3Dtrantor02.umiacs.umd.edu hostOrIp=3D<= a href=3D"http://trantor02.umiacs.umd.edu" rel=3D"noreferrer" target=3D"_bl= ank">trantor02.umiacs.umd.edu,
> 2016-10-27 18:53:56,634 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:54:26,638 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:54:56,640 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:55:26,641 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:55:56,642 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:56:26,643 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:56:56,644 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:57:26,645 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:57:56,646 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:58:26,647 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:58:56,675 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:59:26,676 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 18:59:56,677 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:00:26,678 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:00:56,679 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:01:26,680 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:01:29,610 WARN [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: caught end of stream exce= ption
> 0x158084f5b2100c6, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn= .java:220)
> at
> or= g.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerC= nxnFactory.java:208)
> 2016-10-27 19:01:29,612 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: Closed socket connection = for
> client /192.168.74.212:53136 which had sessionid 0x158084f5b2100= c6
> 2016-10-27 19:01:31,702 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket co= nnection
> from /192.168.74.212:56696
> 2016-10-27 19:01:31,711 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to= renew
> session 0x158084f5b2100c6 at /192.168.74.212:56696
> 2016-10-27 19:01:31,712 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session<= br> > 0x158084f5b2100c6 with negotiated timeout 600000 for client
> /192.168.74.212:56696
> 2016-10-27 19:01:56,681 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:02:20,029 WARN [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: caught end of stream exce= ption
> 0x158084f5b2100c5, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn= .java:220)
> at
> or= g.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerC= nxnFactory.java:208)
> 2016-10-27 19:02:20,030 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: Closed socket connection = for
> client /192.168.74.212:53134 which had sessionid 0x158084f5b2100= c5
> 2016-10-27 19:02:21,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket co= nnection
> from /192.168.74.212:56718
> 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to= renew
> session 0x158084f5b2100c5 at /192.168.74.212:56718
> 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session<= br> > 0x158084f5b2100c5 with negotiated timeout 600000 for client
> /192.168.74.212:56718
> 2016-10-27 19:02:26,682 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:02:56,683 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:03:05,743 WARN [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: caught end of stream exce= ption
> 0x158084f5b2100b9, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn= .java:220)
> at
> or= g.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerC= nxnFactory.java:208)
> 2016-10-27 19:03:05,744 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: Closed socket connection = for
> client /192.168.74.203:51130 which had sessionid 0x158084f5b2100= b9
> 2016-10-27 19:03:07,452 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket co= nnection
> from /192.168.74.203:54676
> 2016-10-27 19:03:07,493 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to= renew
> session 0x158084f5b2100b9 at /192.168.74.203:54676
> 2016-10-27 19:03:07,494 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session<= br> > 0x158084f5b2100b9 with negotiated timeout 600000 for client
> /192.168.74.203:54676
> 2016-10-27 19:03:26,684 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:03:53,712 WARN [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: caught end of stream exce= ption
> 0x158084f5b2100be, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn= .java:220)
> at
> or= g.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerC= nxnFactory.java:208)
> 2016-10-27 19:03:53,713 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxn: Closed socket connection = for
> client /192.168.74.203:51146 which had sessionid 0x158084f5b2100= be
> 2016-10-27 19:03:55,436 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> or= g.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket co= nnection
> from /192.168.74.203:54694
> 2016-10-27 19:03:55,482 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Client attempting to= renew
> session 0x158084f5b2100be at /192.168.74.203:54694
> 2016-10-27 19:03:55,483 INFO [NIOServerCxn.Factory:0.0.0.0/0.0= .0.0:22181]
> org.apache.zookeeper.server.ZooKeeperServer: Established session<= br> > 0x158084f5b2100be with negotiated timeout 600000 for client
> /192.168.74.203:54694
> 2016-10-27 19:03:56,719 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0= out of 199
> workers finished on superstep 0 on path
> 2016-10-27 19:04:00,000 INFO [SessionTracker]
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x158084f5b2100b8, timeout of 600000ms exceeded
> 2016-10-27 19:04:00,001 INFO [SessionTracker]
> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
> 0x158084f5b2100c2, timeout of 600000ms exceeded
> 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
> or= g.apache.zookeeper.server.PrepRequestProcessor: Processed session<= br> > termination for sessionid: 0x158084f5b2100b8
> 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
> or= g.apache.zookeeper.server.PrepRequestProcessor: Processed session<= br> > termination for sessionid: 0x158084f5b2100c2
> or= g.apache.zookeeper.server.NIOServerCnxn: Closed socket connection = for
> client /192.168.74.203:51116 which had sessionid 0x158084f5b2100= b8
> or= g.apache.zookeeper.server.NIOServerCnxn: Closed socket connection = for
> client /192.168.74.212:53128 which had sessionid 0x158084f5b2100= c2
> 2016-10-27 19:04:00,033 INFO [org.apache.giraph.master.MasterThre= ad]
> org.apache.giraph.master.BspServiceMaster: setJobState:
> {"_applicationAttemptKey":-1,"_stateKey":&quo= t;FAILED","_superstepKey":-1} on
> superstep 0
>
>
> Thanks,
>
> Hai
>
>
> On Tue, Nov 8, 2016 at 6:37 AM, Denis Dudinski <denis.dudinski@gmail.com>=
> wrote:
>>
>> Hi Xenia,
>>
>> Thank you! I'll check the thread you mentioned.
>>
>> Best Regards,
>> Denis Dudinski
>>
>> 2016-11-08 14:16 GMT+03:00 Xenia Demetriou <xeniad20@gmail.com>:
>> > Hi Denis,
>> >
>> > For the "java.lang.OutOfMemoryError: GC overhead limit e= xceeded" error
>> > I
>> > hope that the=C2=A0 conversation in below link can help you.<= br> >> >=C2=A0 www.mail-archive= .com/user@giraph.apache.org/msg02938.html
>> >
>> > Regards,
>> > Xenia
>> >
>> > 2016-11-08 12:25 GMT+02:00 Denis Dudinski <denis.dudinski@gmail.com= >:
>> >>
>> >> Hi Hassan,
>> >>
>> >> Thank you for really quick response!
>> >>
>> >> I changed "giraph.isStaticGraph" to false and t= he error disappeared.
>> >> As expected iteration became slow and wrote to disk edges= once again
>> >> in superstep 1.
>> >>
>> >> However, the computation failed at superstep 2 with error=
>> >> "java.lang.OutOfMemoryError: GC overhead limit excee= ded". It seems to
>> >> be unrelated to "isStaticGraph" issue, but I th= ink it worth mentioning
>> >> to see the picture as a whole.
>> >>
>> >> Are there any other tests/information I am able to execut= e/check to
>> >> help to pinpoint "isStaticGraph" problem?
>> >>
>> >> Best Regards,
>> >> Denis Dudinski
>> >>
>> >>
>> >> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.eslami@gmail.com>:=
>> >> > Hi Denis,
>> >> >
>> >> > Thanks for bringing up the issue. In the previous co= nversation
>> >> > the
>> >> > similar problem is reported even with a simpler exam= ple connected
>> >> > component
>> >> > calculation. Although, back then, we were developing= other
>> >> > performance-critical components of OOC.
>> >> >
>> >> > Let's debug this issue together to make the new = OOC more stable. I
>> >> > suspect
>> >> > the problem is with "giraph.isStaticGraph=3Dtru= e" (as this is only an
>> >> > optimization and most of our end-to-end testing was = on cases where
>> >> > the
>> >> > graph
>> >> > could change). Let's get rid of it for now and s= ee if the problem
>> >> > still
>> >> > exists.
>> >> >
>> >> > Best,
>> >> > Hassan
>> >> >
>> >> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski
>> >> > <denis.dudinski@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hello,
>> >> >>
>> >> >> We are trying to calculate PageRank on huge grap= h, which does not
>> >> >> fit
>> >> >> into memory. For calculation to succeed we tried= to turn on
>> >> >> OutOfCore
>> >> >> feature of Giraph, but every launch we tried res= ulted in
>> >> >> com.esotericsoftware.kryo.KryoException: Bu= ffer underflow.
>> >> >> Each time it happens on different servers but ex= actly right after
>> >> >> start of superstep 1.
>> >> >>
>> >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our p= rod version, can't
>> >> >> back-step to Giraph's officially supported v= ersion and had to patch
>> >> >> Giraph a little) placed on 11 servers + 3 master= servers (namenodes
>> >> >> etc.) with separate ZooKeeper cluster deployment= .
>> >> >>
>> >> >> Our launch command:
>> >> >>
>> >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-wit= h-dependencies.jar
>> >> >> org.apache.giraph.GiraphRunner
>> >> >> com.prototype.di.pr.PageRankComputation
>> >> >> \
>> >> >> -mc com.prototype.di.pr.PageRankMasterCompute \
>> >> >> -yj pr-job-jar-with-dependencies.jar \
>> >> >> -vif com.belprime.di.pr.input.HBLongVertexI= nputFormat \
>> >> >> -vof org.apache.giraph.io.formats.IdWithVal= ueTextOutputFormat \
>> >> >> -op /user/hadoop/output/pr_test \
>> >> >> -w 10 \
>> >> >> -c com.prototype.di.pr.PRDoubleCombiner \
>> >> >> -wc com.prototype.di.pr.PageRankWorkerContext \
>> >> >> -ca hbase.rootdir=3Dhdfs://namenod= e1.webmeup.com:8020/hbase \
>> >> >> -ca giraph.logLevel=3Dinfo \
>> >> >> -ca hbase.mapreduce.inputtable=3Ddi_test \<= br> >> >> >> -ca hbase.mapreduce.scan.columns=3Ddi:n \ >> >> >> -ca hbase.defaults.for.version.skip=3Dtrue = \
>> >> >> -ca hbase.table.row.textkey=3Dfalse \
>> >> >> -ca giraph.yarn.task.heap.mb=3D48000 \
>> >> >> -ca giraph.isStaticGraph=3Dtrue \
>> >> >> -ca giraph.SplitMasterWorker=3Dfalse \
>> >> >> -ca giraph.oneToAllMsgSending=3Dtrue \
>> >> >> -ca giraph.metrics.enable=3Dtrue \
>> >> >> -ca giraph.jmap.histo.enable=3Dtrue \
>> >> >> -ca
>> >> >> giraph.vertexIdClass=3Dcom.prototype.di.pr.DomainPartAwareLon= gWritable
>> >> >> \
>> >> >> -ca
>> >> >> \
>> >> >> -ca
>> >> >> giraph.inputOutEdgesClass=3Dorg.apache.gira= ph.edge.LongNullArrayEdges
>> >> >> \
>> >> >> -ca giraph.useOutOfCoreGraph=3Dtrue \
>> >> >> -ca giraph.waitForPerWorkerRequests=3Dtrue = \
>> >> >> -ca giraph.maxNumberOfUnsentRequests=3D1000= \
>> >> >> -ca
>> >> >>
>> >> >>
>> >> >> giraph.vertexInputFilterClass=3Dcom.prototy= pe.di.pr.input.PagesFromSameDomainLimiter
>> >> >> \
>> >> >> -ca giraph.useInputSplitLocality=3Dtrue \ >> >> >> -ca hbase.mapreduce.scan.cachedrows=3D10000= \
>> >> >> -ca giraph.minPartitionsPerComputeThread=3D= 60 \
>> >> >> -ca
>> >> >>
>> >> >>
>> >> >> giraph.graphPartitionerFactoryClass=3Dcom.prototype.di.pr= .DomainAwareGraphPartitionerFactory
>> >> >> \
>> >> >> -ca giraph.numInputThreads=3D1 \
>> >> >> -ca giraph.inputSplitSamplePercent=3D20 \ >> >> >> -ca giraph.pr.maxNeighborsPerVertex=3D50 \
>> >> >> -ca
>> >> >> giraph.partitionClass=3Dorg.apache.giraph.p= artition.ByteArrayPartition
>> >> >> \
>> >> >> -ca giraph.vertexClass=3Dorg.apache.giraph.= graph.ByteValueVertex \
>> >> >> -ca
>> >> >>
>> >> >>
>> >> >> giraph.partitionsDirectory=3D/disk1/_bsp/_p= artitions,/disk2/_bsp/_partitions
>> >> >>
>> >> >> Logs excerpt:
>> >> >>
>> >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext:= Pre superstep in
>> >> >> worker
>> >> >> context
>> >> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: e= xecute: 60
>> >> >> partitions
>> >> >> to process with 1 compute thread(s), originally = 1 thread(s) on
>> >> >> superstep 1
>> >> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: star= tIteration: with 60
>> >> >> partitions in memory and 1 active threads
>> >> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: P= re superstep1 in PR
>> >> >> computation
>> >> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.75
>> >> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOrac= le:
>> >> >> updateRequestsCredit: updating the credit to 20<= br> >> >> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager: i= nstallGCMonitoring:
>> >> >> name =3D PS Scavenge, action =3D end of minor GC= , cause =3D Allocation
>> >> >> Failure, duration =3D 937ms
>> >> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.72
>> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.74
>> >> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOrac= le:
>> >> >> updateRequestsCredit: updating the credit to 20<= br> >> >> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.76
>> >> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: done= ProcessingPartition:
>> >> >> processing partition 234 is done!
>> >> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.79
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOrac= le:
>> >> >> updateRequestsCredit: updating the credit to 18<= br> >> >> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: d= ecode: Server window
>> >> >> metrics MBytes/sec received =3D 1.0994, MBytesRe= ceived =3D 33.0459, ave
>> >> >> received req MBytes =3D 0.0138, secs waited =3D = 30.058
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.82
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: = call: thread 0's
>> >> >> next
>> >> >> IO command is: StorePartitionIOCommand: (partiti= onId =3D 234)
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: = call: thread 0's
>> >> >> command StorePartitionIOCommand: (partitionId = =3D 234) completed:
>> >> >> bytes=3D
>> >> >> 64419740, duration=3D351, bandwidth=3D175.03, ba= ndwidth (excluding GC
>> >> >> time)=3D175.03
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.83
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: = call: thread 0's
>> >> >> next
>> >> >> IO command is: StoreIncomingMessageIOCommand: (p= artitionId =3D 234)
>> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: = call: thread 0's
>> >> >> command StoreIncomingMessageIOCommand: (partitio= nId =3D 234)
>> >> >> completed:
>> >> >> bytes=3D 0, duration=3D0, bandwidth=3DNaN, bandw= idth (excluding GC
>> >> >> time)=3DNaN
>> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.83
>> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: i= nstallGCMonitoring:
>> >> >> name =3D PS Scavenge, action =3D end of minor GC= , cause =3D Allocation
>> >> >> Failure, duration =3D 3107ms
>> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: i= nstallGCMonitoring:
>> >> >> name =3D PS MarkSweep, action =3D end of major G= C, cause =3D Ergonomics,
>> >> >> duration =3D 15064ms
>> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOrac= le:
>> >> >> updateRequestsCredit: updating the credit to 20<= br> >> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOrac= le:
>> >> >> getNextIOActions:
>> >> >> usedMemoryFraction =3D 0.71
>> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: = call: thread 0's
>> >> >> next
>> >> >> IO command is: LoadPartitionIOCommand: (partitio= nId =3D 234, superstep
>> >> >> =3D
>> >> >> 2)
>> >> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016<= br> >> >> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: done= ProcessingPartition:
>> >> >> processing partition 364 is done!
>> >> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
>> >> >> to
>> >> >> 1
>> >> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOrac= le:
>> >> >> updateRequestsCredit: updating the credit to 20<= br> >> >> >> --
>> >> >> -- num=C2=A0 =C2=A0 =C2=A0#instances=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0#bytes=C2=A0 class name
>> >> >> -- ----------------------------------------= ------
>> >> >> --=C2=A0 =C2=A01:=C2=A0 =C2=A0 =C2=A0224004229= =C2=A0 =C2=A0 10752202992
>> >> >> java.util.concurrent.ConcurrentHashMap\$Node=
>> >> >> --=C2=A0 =C2=A02:=C2=A0 =C2=A0 =C2=A0 19751666= =C2=A0 =C2=A0 =C2=A06645730528=C2=A0 [B
>> >> >> --=C2=A0 =C2=A03:=C2=A0 =C2=A0 =C2=A0222135985= =C2=A0 =C2=A0 =C2=A05331263640
>> >> >> com.belprime.di.pr.DomainPartAwareLongWritable
>> >> >> --=C2=A0 =C2=A04:=C2=A0 =C2=A0 =C2=A0214686483= =C2=A0 =C2=A0 =C2=A05152475592
>> >> >> --=C2=A0 =C2=A05:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0353=C2=A0 =C2=A0 =C2=A04357261784
>> >> >> [Ljava.util.concurrent.ConcurrentHashMap\$No= de;
>> >> >> --=C2=A0 =C2=A06:=C2=A0 =C2=A0 =C2=A0 =C2=A0 486= 266=C2=A0 =C2=A0 =C2=A0 204484688=C2=A0 [I
>> >> >> --=C2=A0 =C2=A07:=C2=A0 =C2=A0 =C2=A0 =C2=A06017= 652=C2=A0 =C2=A0 =C2=A0 192564864
>> >> >> org.apache.giraph.utils.UnsafeByteArrayOutp= utStream
>> >> >> --=C2=A0 =C2=A08:=C2=A0 =C2=A0 =C2=A0 =C2=A03986= 203=C2=A0 =C2=A0 =C2=A0 159448120
>> >> >> org.apache.giraph.utils.UnsafeByteArrayInpu= tStream
>> >> >> --=C2=A0 =C2=A09:=C2=A0 =C2=A0 =C2=A0 =C2=A02064= 182=C2=A0 =C2=A0 =C2=A0 148621104
>> >> >> org.apache.giraph.graph.ByteValueVertex
>> >> >> --=C2=A0 10:=C2=A0 =C2=A0 =C2=A0 =C2=A02064182= =C2=A0 =C2=A0 =C2=A0 =C2=A082567280
>> >> >> org.apache.giraph.edge.ByteArrayEdges
>> >> >> --=C2=A0 11:=C2=A0 =C2=A0 =C2=A0 =C2=A01886875= =C2=A0 =C2=A0 =C2=A0 =C2=A045285000=C2=A0 java.lang.Integer
>> >> >> --=C2=A0 12:=C2=A0 =C2=A0 =C2=A0 =C2=A0 349409= =C2=A0 =C2=A0 =C2=A0 =C2=A030747992
>> >> >> java.util.concurrent.ConcurrentHashMap\$Tree= Node
>> >> >> --=C2=A0 13:=C2=A0 =C2=A0 =C2=A0 =C2=A0 916970= =C2=A0 =C2=A0 =C2=A0 =C2=A029343040=C2=A0 java.util.Collections\$1
>> >> >> --=C2=A0 14:=C2=A0 =C2=A0 =C2=A0 =C2=A0 916971= =C2=A0 =C2=A0 =C2=A0 =C2=A022007304
>> >> >> java.util.Collections\$SingletonSet
>> >> >> --=C2=A0 15:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A047= 270=C2=A0 =C2=A0 =C2=A0 =C2=A0 3781600
>> >> >> java.util.concurrent.ConcurrentHashMap\$Tree= Bin
>> >> >> --=C2=A0 16:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A026= 201=C2=A0 =C2=A0 =C2=A0 =C2=A0 2590912=C2=A0 [C
>> >> >> --=C2=A0 17:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A034= 175=C2=A0 =C2=A0 =C2=A0 =C2=A0 1367000
>> >> >> org.apache.giraph.edge.ByteArrayEdges\$ByteA= rrayEdgeIterator
>> >> >> --=C2=A0 18:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 6= 143=C2=A0 =C2=A0 =C2=A0 =C2=A0 1067704=C2=A0 java.lang.Class
>> >> >> --=C2=A0 19:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A025= 953=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0830496=C2=A0 java.lang.String
>> >> >> --=C2=A0 20:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A034= 175=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0820200
>> >> >> org.apache.giraph.edge.EdgeNoValue
>> >> >> --=C2=A0 21:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 4= 488=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0703400=C2=A0 [Ljava.lang.Object;
>> >> >> --=C2=A0 22:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 70=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0395424
>> >> >> [Ljava.nio.channels.SelectionKey;
>> >> >> --=C2=A0 23:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 2= 052=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0328320=C2=A0 java.lang.reflect.Method<= br> >> >> >> --=C2=A0 24:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 6= 600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0316800
>> >> >> org.apache.giraph.utils.ByteArrayVertexIdMe= ssages
>> >> >> --=C2=A0 25:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 5= 781=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0277488=C2=A0 java.util.HashMap\$Node >> >> >> --=C2=A0 26:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 5= 651=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0271248=C2=A0 java.util.Hashtable\$Entry=
>> >> >> --=C2=A0 27:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 6= 604=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0211328
>> >> >> org.apache.giraph.factories.DefaultMessageValu= eFactory
>> >> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCalla= ble: Execution of
>> >> >> callable failed
>> >> >> java.lang.RuntimeException: call: execution of I= O command
>> >> >> LoadPartitionIOCommand: (partitionId =3D 234, su= perstep =3D 2) failed!
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.c= all(OutOfCoreIOCallable.java:115)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.c= all(OutOfCoreIOCallable.java:36)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.utils.LogStacktraceCallab= le.call(LogStacktraceCallable.java:67)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at java.lang.Thread.run(Thread.java:745) >> >> >> Caused by: com.esotericsoftware.kryo.KryoEx= ception: Buffer
>> >> >> underflow.
>> >> >> at com.esotericsoftware.kryo.io.Input.require(In= put.java:199)
>> >> >> at
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartit= ionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:= 278)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.c= all(OutOfCoreIOCallable.java:99)
>> >> >> ... 6 more
>> >> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: = uncaughtException:
>> >> >> OverrideExceptionHandler on thread ooc-io-0, msg= =3D call: execution
>> >> >> of
>> >> >> IO command LoadPartitionIOCommand: (partitionId = =3D 234, superstep =3D
>> >> >> 2)
>> >> >> failed!, exiting...
>> >> >> java.lang.RuntimeException: call: execution of I= O command
>> >> >> LoadPartitionIOCommand: (partitionId =3D 234, su= perstep =3D 2) failed!
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.c= all(OutOfCoreIOCallable.java:115)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.c= all(OutOfCoreIOCallable.java:36)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.utils.LogStacktraceCallab= le.call(LogStacktraceCallable.java:67)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at java.lang.Thread.run(Thread.java:745) >> >> >> Caused by: com.esotericsoftware.kryo.KryoEx= ception: Buffer
>> >> >> underflow.
>> >> >> at com.esotericsoftware.kryo.io.Input.require(In= put.java:199)
>> >> >> at
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.data.DiskBackedPartit= ionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:= 278)
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> at
>> >> >>
>> >> >>
>> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.c= all(OutOfCoreIOCallable.java:99)
>> >> >> ... 6 more
>> >> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker:= unregisterHealth:
>> >> >> Got
>> >> >> failure, unregistering health on
>> >> >>
>> >> >>
>> >> >>
>> >> >> /_hadoopBsp/giraph_yarn_application_1478342= 673283_0009/_applicationAttemptsDir/0/_superstepDir/1/_workerHeal= thyDir/datanode6.webmeup.com_5
>> >> >> on superstep 1
>> >> >>
>> >> >> We looked into one thread
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://= mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAEC= WHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%3DoLJAAPQw%40mail.gmail.co= m%3E
>> >> >> but it is rather old and at that time the answer= was "do not use it
>> >> >> yet".
>> >> >>
>> >> >>
>> >> >> http://= mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAH1= LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%40mail.gmail.co= m%3E).
>> >> >> Does it hold today? We would like to use new adv= anced adaptive OOC
>> >> >> approach if possible...
>> >> >>
>> >> >> Thank you in advance, any help or hint would be = really appreciated.
>> >> >>
>> >> >> Best Regards,
>> >> >> Denis Dudinski
>> >> >
>> >> >
>> >
>> >
>
>

--001a1144ab6a75f2040540e206b5--