Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
MIME-Version: 1.0
In-Reply-To: <CALWHDexBrYfLwiW9V9CiCqXSwq4u1wbDOXbpK_Pbpobn2HmeMg@mail.gmail.com>
References: <CALWHDexBrYfLwiW9V9CiCqXSwq4u1wbDOXbpK_Pbpobn2HmeMg@mail.gmail.com>
From: =?UTF-8?Q?Jos=C3=A9_Luis_Larroque?= <larroquester@gmail.com>
Date: Sat, 27 Aug 2016 21:33:42 -0300
Message-ID: <CALWHDez0ikYPnhPTWr1r+opRZC=yfLnYFtT8jQxMCUzi6+fncg@mail.gmail.com>
Subject: Re: Giraph application get stuck, on superstep 4, all workers active
 but without progress
To: user@giraph.apache.org
Content-Type: multipart/alternative; boundary=001a113de352804e54053b16e733
archived-at: Sun, 28 Aug 2016 00:34:17 -0000

--001a113de352804e54053b16e733
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Using giraph.maxNumberOfOpenRequests and
giraph.waitForRequestsConfirmation=3Dtrue didn't solve the problem.

I duplicated the netty threads, and assigned the double of the original
size to netty buffers, and no change.

I condensed the messages, 1000 into 1, and get a lot of less messages, but
still, same final results.

Please, help.

2016-08-26 21:24 GMT-03:00 Jos=C3=A9 Luis Larroque <larroquester@gmail.com>=
:

> Hi again guys!
>
> I'm doing BFS search through the Wikipedia (spanish edition) site. I
> converted the dump <https://dumps.wikimedia.org/eswiki/20160601/> (
> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be
> read with Giraph.
>
> The BFS is searching for paths, and its all ok until get stuck in some
> point of the superstep four.
>
> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each nod=
e
> is a r3.8xlarge ec2 instance. The command for executing the BFS is this o=
ne:
> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionale
> sWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.
> IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia=
.txt
> -vof ar.edu.info.unlp.tesina.vertice.estructuras.
> IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegac=
ionales
> -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=3D
> true,giraph.metrics.enable=3Dtrue,giraph.maxMessagesInMemory=3D
> 1000000000,giraph.isStaticGraph=3Dtrue,*giraph.logLevel=3DDebug*
>
> Each container have 120GB (almost). I'm using 1000M messages limit in
> outOfCore, because i believed that was the problem, but  apparently is no=
t.
>
> This ones are the master logs (it seems that is waiting for workers for
> finish but they just don't...and keeps like this forever...):
>
> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
> finished worker list =3D [], size =3D 0, worker list =3D
> [Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D300=
00),
> Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000=
1),
> Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000=
2),
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000=
4)],
> size =3D 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_
> 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>
> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
> 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
> signaled of false*
> ...thirty times same last two lines...
> ...
> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
> finished worker list =3D [], size =3D 0, worker list =3D
> [Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D300=
00),
> Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000=
1),
> Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000=
2),
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000=
4)],
> size =3D 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_
> 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>
> And in *all* workers, there is no information on what is happening (i'm
> testing this with *giraph.logLevel=3DDebug* because with the default leve=
l
> of giraph log i was lost), and the workers say this over and over again:
>
> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result no=
t
> ready yet java.util.concurrent.FutureTask@7392f34d
> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82
>
> Before starting the superstep 4, the information on each worker was the
> following one
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2]
> startSuperstep: WORKER_ONLY - Attempt=3D0, Superstep=3D4
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep:
> addressesAndPartitions[Worker(hostname=3Dip-172-31-29-14.ec2.internal,
> MRtaskID=3D0, port=3D30000), Worker(hostname=3Dip-172-31-29-16.ec2.intern=
al,
> MRtaskID
> =3D1, port=3D30001), Worker(hostname=3Dip-172-31-29-15.ec2.internal,
> MRtaskID=3D2, port=3D30002), Worker(hostname=3Dip-172-31-29-14.ec2.intern=
al,
> MRtaskID=3D4, port=3D30004)]
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000=
0)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1
> Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000=
1)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2
> Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000=
2)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000=
4)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000=
0)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5
> Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000=
1)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6
> Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000=
2)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000=
4)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000=
0)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9
> Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000=
1)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10
> Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000=
2)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000=
4)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000=
0)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13
> Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000=
1)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14
> Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000=
2)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15
> Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000=
4)
> 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory
> (free/total/max) =3D 92421.41M / 115000.00M / 115000.00M
>
>
> I don't know what is exactly failing:
> - i know that all containers have memory available, on datanodes i check
> that each one had like 50 GB available.
> - I'm not sure if i'm hitting some sort of limit in the use of outOfCore.
> I know that writing messages too fast is dangerous with 1.1 version of
> Giraph, but if i hit that limit, i suppose that the container will fail,
> right?
> - Maybe the connections for zookeeper client aren't enough? I read that
> maybe the 60 default value in zookeeper for *maxClientCnxns* is too small
> for a context like AWS, but i'm not fully aware of the relationship betwe=
en
> Giraph and Zookeeper for start changing default configuration values
> - Maybe i have to tune outOfCore configuration? Using
> giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=3Dt=
rue
> like someone recommend here (http://mail-archives.apache.
> org/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%
> 25majakabiljo@fb.com%3E) ?
> - Should i tune the netty configuration? I have the default configuration=
,
> but i believe that maybe using only 8 netty client and 8 server threads
> will be enough, since that i have only a few workers and maybe too much
> threads of netty are making the overhead that is doing that entire
> application get stuck
> - Using giraph.useBigDataIOForMessages=3Dtrue didn't help me either, i kn=
ow
> that each vertex is receiving 100 M or more messages and that property
> should be helpful, but didn't make any difference anyway
>
> As you maybe are suspecting, i have too many hypothesis, that's why i'm
> seeking for help, so i can go in the right direction.
>
> Any help would be greatly appreciated.
>
> Bye!
> Jose
>
>
>
>
>

--001a113de352804e54053b16e733
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><p>Using giraph.maxNumberOfOpenRequests and giraph.waitFor=
RequestsConfirmation=3Dtrue didn&#39;t solve the problem.</p>

<p>I duplicated the netty threads, and assigned the double of the original =
size to netty buffers, and no change.</p>

<p>I condensed the messages, 1000 into 1, and get a lot of less messages, b=
ut still, same final results.</p><p></p><p>Please, help.<br></p></div><div =
class=3D"gmail_extra"><br><div class=3D"gmail_quote">2016-08-26 21:24 GMT-0=
3:00 Jos=C3=A9 Luis Larroque <span dir=3D"ltr">&lt;<a href=3D"mailto:larroq=
uester@gmail.com" target=3D"_blank">larroquester@gmail.com</a>&gt;</span>:<=
br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div=
><div><div><div><div><div><div><div>Hi again guys!<br><br>I&#39;m doing BFS=
 search through the Wikipedia (spanish edition) site. I=20
converted the <a href=3D"https://dumps.wikimedia.org/eswiki/20160601/" targ=
et=3D"_blank">dump</a> (<a href=3D"https://dumps.wikimedia.org/eswiki/20160=
601" target=3D"_blank">https://dumps.wikimedia.org/<wbr>eswiki/20160601</a>=
) into a file that could be read with Giraph.<br><br></div><div>The BFS is =
searching for paths, and its all ok until get stuck in some point of the su=
perstep four.<br></div><div><br></div><div>I&#39;m using a cluster of 5 nod=
es (4 slaves core, 1 Master) on AWS. Each node is a r3.8xlarge ec2 instance=
. The command for executing the BFS is this one:<br></div><div><span style=
=3D"font-size:10.6667px;font-family:Arial;color:rgb(0,0,0);background-color=
:transparent;font-weight:400;font-style:normal;font-variant:normal;text-dec=
oration:none;vertical-align:baseline">/home/hadoop/bin/yarn jar /home/hadoo=
p/giraph/giraph.jar ar.edu.info.unlp.tesina.<wbr>lectura.grafo.<wbr>Busqued=
aDeCaminosNavegacionale<wbr>sWikiquote</span><span style=3D"font-size:10.66=
67px;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-w=
eight:400;font-style:normal;font-variant:normal;text-decoration:none;vertic=
al-align:baseline"> -vif ar.edu.info.unlp.tesina.<wbr>vertice.estructuras.<=
wbr>IdTextWithComplexValueInputFor<wbr>mat -vip /user/hduser/input/grafo-<w=
br>wikipedia.txt -vof ar.edu.info.unlp.tesina.<wbr>vertice.estructuras.<wbr=
>IdTextWithComplexValueOutputFo<wbr>rmat -op /user/hduser/output/<wbr>camin=
osNavegacionales -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=3D<wbr>tru=
e,giraph.metrics.enable=3D<wbr>true,giraph.<wbr>maxMessagesInMemory=3D<wbr>=
1000000000,giraph.<wbr>isStaticGraph=3Dtrue,</span><span style=3D"font-size=
:10.6667px;font-family:Arial;color:rgb(34,34,34);background-color:rgb(255,2=
55,255);font-weight:400;font-style:normal;font-variant:normal;text-decorati=
on:none;vertical-align:baseline"><font size=3D"2"><b>giraph.<wbr>logLevel=
=3DDebug</b></font></span><br><br></div><div>Each container have 120GB (alm=
ost). I&#39;m using 1000M messages limit in outOfCore, because i believed t=
hat was the problem, but=C2=A0 apparently is not.<br></div><div><br></div>T=
his ones are the master logs (it seems that is waiting for workers for fini=
sh but they just don&#39;t...and keeps like this forever...):<br><br><span =
style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,34,34);backgrou=
nd-color:rgb(255,255,255);font-weight:400;font-style:normal;font-variant:no=
rmal;text-decoration:none;vertical-align:baseline">6/08/26 00:43:08 INFO ya=
rn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out =
of 4 on superstep 4</span><span style=3D"font-size:10.6667px;font-family:Ar=
ial;color:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;f=
ont-style:normal;font-variant:normal;text-decoration:none;vertical-align:ba=
seline"><br></span><span style=3D"font-size:10.6667px;font-family:Arial;col=
or:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-sty=
le:normal;font-variant:normal;text-decoration:none;vertical-align:baseline"=
>16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got =
finished worker list =3D [], size =3D 0, worker list =3D [Worker(hostname=
=3Dip-172-31-29-<wbr>14.ec2.internal, MRtaskID=3D0, port=3D30000), Worker(h=
ostname=3Dip-172-31-29-<wbr>16.ec2.internal, MRtaskID=3D1, port=3D30001), W=
orker(hostname=3Dip-172-31-29-<wbr>15.ec2.internal, MRtaskID=3D2, port=3D30=
002), Worker(hostname=3Dip-172-31-29-<wbr>14.ec2.internal, MRtaskID=3D4, po=
rt=3D30004)], size =3D 4 from /_hadoopBsp/giraph_yarn_<wbr>application_1472=
168758138_<wbr>0002/_applicationAttemptsDir/<wbr>0/_superstepDir/4/_<wbr>wo=
rkerFinishedDir</span><span style=3D"font-size:10.6667px;font-family:Arial;=
color:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-=
style:normal;font-variant:normal;text-decoration:none;vertical-align:baseli=
ne"><br></span><span style=3D"font-size:10.6667px;font-family:Arial;color:r=
gb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-style:n=
ormal;font-variant:normal;text-decoration:none;vertical-align:baseline">16/=
08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_=
ONLY - 0 finished out of 4 on superstep 4</span><span style=3D"font-size:10=
.6667px;font-family:Arial;color:rgb(34,34,34);background-color:rgb(255,255,=
255);font-weight:400;font-style:normal;font-variant:normal;text-decoration:=
none;vertical-align:baseline"><br></span><b><span style=3D"font-size:10.666=
7px;font-family:Arial;color:rgb(34,34,34);background-color:rgb(255,255,255)=
;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none=
;vertical-align:baseline">16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMse=
cs: Wait for 10000</span><span style=3D"font-size:10.6667px;font-family:Ari=
al;color:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;fo=
nt-style:normal;font-variant:normal;text-decoration:none;vertical-align:bas=
eline"><br></span><span style=3D"font-size:10.6667px;font-family:Arial;colo=
r:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-styl=
e:normal;font-variant:normal;text-decoration:none;vertical-align:baseline">=
16/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of =
false</span></b><span style=3D"font-size:10.6667px;font-family:Arial;color:=
rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-style:=
normal;font-variant:normal;text-decoration:none;vertical-align:baseline"><b=
r></span><span style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,=
34,34);background-color:rgb(255,255,255);font-weight:400;font-style:normal;=
font-variant:normal;text-decoration:none;vertical-align:baseline">...thirty=
 times same last two lines...<br>...<br></span><span style=3D"font-size:10.=
6667px;font-family:Arial;color:rgb(34,34,34);background-color:rgb(255,255,2=
55);font-weight:400;font-style:normal;font-variant:normal;text-decoration:n=
one;vertical-align:baseline">6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [ST=
ATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4</s=
pan><span style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,34,34=
);background-color:rgb(255,255,255);font-weight:400;font-style:normal;font-=
variant:normal;text-decoration:none;vertical-align:baseline"><br></span><sp=
an style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,34,34);backg=
round-color:rgb(255,255,255);font-weight:400;font-style:normal;font-variant=
:normal;text-decoration:none;vertical-align:baseline">16/08/26
 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got=20
finished worker list =3D [], size =3D 0, worker list =3D=20
[Worker(hostname=3Dip-172-31-29-<wbr>14.ec2.internal, MRtaskID=3D0, port=3D=
30000),=20
Worker(hostname=3Dip-172-31-29-<wbr>16.ec2.internal, MRtaskID=3D1, port=3D3=
0001),=20
Worker(hostname=3Dip-172-31-29-<wbr>15.ec2.internal, MRtaskID=3D2, port=3D3=
0002),=20
Worker(hostname=3Dip-172-31-29-<wbr>14.ec2.internal, MRtaskID=3D4, port=3D3=
0004)],=20
size =3D 4 from=20
/_hadoopBsp/giraph_yarn_<wbr>application_1472168758138_<wbr>0002/_applicati=
onAttemptsDir/<wbr>0/_superstepDir/4/_<wbr>workerFinishedDir</span><span st=
yle=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,34,34);background=
-color:rgb(255,255,255);font-weight:400;font-style:normal;font-variant:norm=
al;text-decoration:none;vertical-align:baseline"><br></span><span style=3D"=
font-size:10.6667px;font-family:Arial;color:rgb(34,34,34);background-color:=
rgb(255,255,255);font-weight:400;font-style:normal;font-variant:normal;text=
-decoration:none;vertical-align:baseline">16/08/26 00:43:08 INFO yarn.Girap=
hYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on =
superstep 4</span><span style=3D"font-size:10.6667px;font-family:Arial;colo=
r:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-styl=
e:normal;font-variant:normal;text-decoration:none;vertical-align:baseline">=
<br><br></span></div><span style=3D"font-size:10.6667px;font-family:Arial;c=
olor:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-s=
tyle:normal;font-variant:normal;text-decoration:none;vertical-align:baselin=
e"><font size=3D"2">And in <b>all</b> workers, there is no information on w=
hat is happening (i&#39;m testing this with <b>giraph.logLevel=3DDebug</b> =
because with the default level of giraph log i was lost), and the workers s=
ay this over and over again:</font><br><br>16/08/26 01:05:08 INFO utils.Pro=
gressableUtils: waitFor: Future result not ready yet java.util.concurrent.<=
wbr>FutureTask@7392f34d<br>16/08/26 01:05:08 INFO utils.ProgressableUtils: =
waitFor: Waiting for org.apache.giraph.utils.<wbr>ProgressableUtils$<wbr>Fu=
tureWaitable@34a37f82<br><br></span></div><div><span style=3D"font-size:10.=
6667px;font-family:Arial;color:rgb(34,34,34);background-color:rgb(255,255,2=
55);font-weight:400;font-style:normal;font-variant:normal;text-decoration:n=
one;vertical-align:baseline"><font size=3D"2">Before starting the superstep=
 4, the information on each worker was the following one</font><br></span><=
/div><div><span style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34=
,34,34);background-color:rgb(255,255,255);font-weight:400;font-style:normal=
;font-variant:normal;text-decoration:none;vertical-align:baseline"><font si=
ze=3D"1">16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] start=
Superstep: WORKER_ONLY - Attempt=3D0, Superstep=3D4<br>16/08/26 00:43:08 DE=
BUG worker.BspServiceWorker: startSuperstep: addressesAndPartitions[Worker(=
<wbr>hostname=3Dip-172-31-29-14.ec2.<wbr>internal, MRtaskID=3D0, port=3D300=
00), Worker(hostname=3Dip-172-31-29-<wbr>16.ec2.internal, MRtaskID<br>=3D1,=
 port=3D30001), Worker(hostname=3Dip-172-31-29-<wbr>15.ec2.internal, MRtask=
ID=3D2, port=3D30002), Worker(hostname=3Dip-172-31-29-<wbr>14.ec2.internal,=
 MRtaskID=3D4, port=3D30004)]<br>16/08/26 00:43:08 DEBUG worker.BspServiceW=
orker: 0 Worker(hostname=3Dip-172-31-29-<wbr>14.ec2.internal, MRtaskID=3D0,=
 port=3D30000)<br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 Worker=
(hostname=3Dip-172-31-29-<wbr>16.ec2.internal, MRtaskID=3D1, port=3D30001)<=
br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 Worker(hostname=3Dip-=
172-31-29-<wbr>15.ec2.internal, MRtaskID=3D2, port=3D30002)<br>16/08/26 00:=
43:08 DEBUG worker.BspServiceWorker: 3 Worker(hostname=3Dip-172-31-29-<wbr>=
14.ec2.internal, MRtaskID=3D4, port=3D30004)<br>16/08/26 00:43:08 DEBUG wor=
ker.BspServiceWorker: 4 Worker(hostname=3Dip-172-31-29-<wbr>14.ec2.internal=
, MRtaskID=3D0, port=3D30000)<br>16/08/26 00:43:08 DEBUG worker.BspServiceW=
orker: 5 Worker(hostname=3Dip-172-31-29-<wbr>16.ec2.internal, MRtaskID=3D1,=
 port=3D30001)<br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 Worker=
(hostname=3Dip-172-31-29-<wbr>15.ec2.internal, MRtaskID=3D2, port=3D30002)<=
br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 Worker(hostname=3Dip-=
172-31-29-<wbr>14.ec2.internal, MRtaskID=3D4, port=3D30004)<br>16/08/26 00:=
43:08 DEBUG worker.BspServiceWorker: 8 Worker(hostname=3Dip-172-31-29-<wbr>=
14.ec2.internal, MRtaskID=3D0, port=3D30000)<br>16/08/26 00:43:08 DEBUG wor=
ker.BspServiceWorker: 9 Worker(hostname=3Dip-172-31-29-<wbr>16.ec2.internal=
, MRtaskID=3D1, port=3D30001)<br>16/08/26 00:43:08 DEBUG worker.BspServiceW=
orker: 10 Worker(hostname=3Dip-172-31-29-<wbr>15.ec2.internal, MRtaskID=3D2=
, port=3D30002)<br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 Work=
er(hostname=3Dip-172-31-29-<wbr>14.ec2.internal, MRtaskID=3D4, port=3D30004=
)<br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 Worker(hostname=3D=
ip-172-31-29-<wbr>14.ec2.internal, MRtaskID=3D0, port=3D30000)<br>16/08/26 =
00:43:08 DEBUG worker.BspServiceWorker: 13 Worker(hostname=3Dip-172-31-29-<=
wbr>16.ec2.internal, MRtaskID=3D1, port=3D30001)<br>16/08/26 00:43:08 DEBUG=
 worker.BspServiceWorker: 14 Worker(hostname=3Dip-172-31-29-<wbr>15.ec2.int=
ernal, MRtaskID=3D2, port=3D30002)<br>16/08/26 00:43:08 DEBUG worker.BspSer=
viceWorker: 15 Worker(hostname=3Dip-172-31-29-<wbr>14.ec2.internal, MRtaskI=
D=3D4, port=3D30004)<br>16/08/26 00:43:08 DEBUG graph.GraphTaskManager: exe=
cute: Memory (free/total/max) =3D 92421.41M / 115000.00M / 115000.00M</font=
><br><br><br></span></div><span style=3D"font-size:10.6667px;font-family:Ar=
ial;color:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;f=
ont-style:normal;font-variant:normal;text-decoration:none;vertical-align:ba=
seline"><font size=3D"2">I don&#39;t know what is exactly failing:<br></fon=
t></span></div><font size=3D"2"><span style=3D"font-family:Arial;color:rgb(=
34,34,34);background-color:rgb(255,255,255);font-weight:400;font-style:norm=
al;font-variant:normal;text-decoration:none;vertical-align:baseline">- i kn=
ow that all containers have memory available, on datanodes i check that eac=
h one had like 50 GB available.<br></span></font></div><font size=3D"2"><sp=
an style=3D"font-family:Arial;color:rgb(34,34,34);background-color:rgb(255,=
255,255);font-weight:400;font-style:normal;font-variant:normal;text-decorat=
ion:none;vertical-align:baseline">- I&#39;m not sure if i&#39;m hitting som=
e sort of limit in the use of outOfCore. I know that writing messages too f=
ast is dangerous with 1.1 version of Giraph, but if i hit that limit, i sup=
pose that the container will fail, right? <br></span></font></div><font siz=
e=3D"2"><span style=3D"font-family:Arial;color:rgb(34,34,34);background-col=
or:rgb(255,255,255);font-weight:400;font-style:normal;font-variant:normal;t=
ext-decoration:none;vertical-align:baseline">- Maybe the connections for zo=
okeeper client aren&#39;t enough? I read that maybe the 60 default value in=
 zookeeper for <b>maxClientCnxns</b> is too small for a context like AWS, b=
ut i&#39;m not fully aware of the relationship between Giraph and Zookeeper=
 for start changing default configuration values<br></span></font></div><fo=
nt size=3D"2"><span style=3D"font-family:Arial;color:rgb(34,34,34);backgrou=
nd-color:rgb(255,255,255);font-weight:400;font-style:normal;font-variant:no=
rmal;text-decoration:none;vertical-align:baseline">- Maybe i have to tune o=
utOfCore configuration? Using giraph.maxNumberOfOpenRequests and</span> gir=
aph.<wbr>waitForRequestsConfirmation=3D<wbr>true<span style=3D"font-family:=
Arial;color:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400=
;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:=
baseline"> like someone recommend here (</span><span style=3D"font-family:A=
rial;color:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;=
font-style:normal;font-variant:normal;text-decoration:none;vertical-align:b=
aseline"><span style=3D"font-family:Arial;color:rgb(17,85,204);background-c=
olor:transparent;font-weight:400;font-style:normal;font-variant:normal;text=
-decoration:underline;vertical-align:baseline"><a href=3D"http://mail-archi=
ves.apache.org/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%25majakabi=
ljo@fb.com%3E" target=3D"_blank">http://mail-archives.apache.<wbr>org/mod_m=
box/giraph-user/<wbr>201209.mbox/%3CCC775449.2C4B%<wbr>25majakabiljo@fb.com=
%3E</a></span>) ?<br></span></font></div><font size=3D"2"><span style=3D"fo=
nt-family:Arial;color:rgb(34,34,34);background-color:rgb(255,255,255);font-=
weight:400;font-style:normal;font-variant:normal;text-decoration:none;verti=
cal-align:baseline">- Should i tune the netty configuration? I have the def=
ault configuration, but i believe that maybe using only 8 netty client and =
8 server threads will be enough, since that i have only a few workers and m=
aybe too much threads of netty are making the overhead that is doing that e=
ntire application get stuck<br></span></font></div><div><font size=3D"2"><s=
pan style=3D"font-family:Arial;color:rgb(34,34,34);background-color:rgb(255=
,255,255);font-weight:400;font-style:normal;font-variant:normal;text-decora=
tion:none;vertical-align:baseline">- Using </span></font><span style=3D"fon=
t-size:10.6667px;font-family:Arial;color:rgb(0,0,0);background-color:transp=
arent;font-weight:400;font-style:normal;font-variant:normal;text-decoration=
:none;vertical-align:baseline"><font size=3D"2">giraph.<wbr>useBigDataIOFor=
Messages=3Dtrue didn&#39;t help me either, i know that each vertex is recei=
ving 100 M or more messages and that property should be helpful, but didn&#=
39;t make any difference anyway</font><br></span></div><div><span style=3D"=
font-size:10.6667px;font-family:Arial;color:rgb(34,34,34);background-color:=
rgb(255,255,255);font-weight:400;font-style:normal;font-variant:normal;text=
-decoration:none;vertical-align:baseline"><font size=3D"2"><br></font></spa=
n></div><span style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,3=
4,34);background-color:rgb(255,255,255);font-weight:400;font-style:normal;f=
ont-variant:normal;text-decoration:none;vertical-align:baseline"><font size=
=3D"2">As you maybe are suspecting, i have too many hypothesis, that&#39;s =
why i&#39;m seeking for help, so i can go in the right direction.<br><br></=
font></span></div><span style=3D"font-size:10.6667px;font-family:Arial;colo=
r:rgb(34,34,34);background-color:rgb(255,255,255);font-weight:400;font-styl=
e:normal;font-variant:normal;text-decoration:none;vertical-align:baseline">=
<font size=3D"2">Any help would be greatly appreciated.<br><br></font></spa=
n></div><span style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,3=
4,34);background-color:rgb(255,255,255);font-weight:400;font-style:normal;f=
ont-variant:normal;text-decoration:none;vertical-align:baseline"><font size=
=3D"2">Bye!<br></font></span></div><span style=3D"font-size:10.6667px;font-=
family:Arial;color:rgb(34,34,34);background-color:rgb(255,255,255);font-wei=
ght:400;font-style:normal;font-variant:normal;text-decoration:none;vertical=
-align:baseline"><font size=3D"2">Jose<br></font></span><div><div><div><div=
><div><span style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,34,=
34);background-color:rgb(255,255,255);font-weight:400;font-style:normal;fon=
t-variant:normal;text-decoration:none;vertical-align:baseline"><font size=
=3D"2"><br><br></font></span><div><div><div><div><div><div><div><div><div><=
div><span style=3D"font-size:10.6667px;font-family:Arial;color:rgb(34,34,34=
);background-color:rgb(255,255,255);font-weight:400;font-style:normal;font-=
variant:normal;text-decoration:none;vertical-align:baseline"><br></span><di=
v><br></div></div></div></div></div></div></div></div></div></div></div></d=
iv></div></div></div></div></div>
</blockquote></div><br></div>

--001a113de352804e54053b16e733--