Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 38ADE200B7D for ; Sat, 27 Aug 2016 02:24:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3708E160AC3; Sat, 27 Aug 2016 00:24:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D1CF4160AB6 for ; Sat, 27 Aug 2016 02:24:53 +0200 (CEST) Received: (qmail 47597 invoked by uid 500); 27 Aug 2016 00:24:52 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 47587 invoked by uid 99); 27 Aug 2016 00:24:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Aug 2016 00:24:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 7E42E1A1380 for ; Sat, 27 Aug 2016 00:24:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.19 X-Spam-Level: * X-Spam-Status: No, score=1.19 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ImgOP7Uyzh-o for ; Sat, 27 Aug 2016 00:24:50 +0000 (UTC) Received: from mail-oi0-f41.google.com (mail-oi0-f41.google.com [209.85.218.41]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id E70935F2F0 for ; Sat, 27 Aug 2016 00:24:49 +0000 (UTC) Received: by mail-oi0-f41.google.com with SMTP id f189so131665189oig.3 for ; Fri, 26 Aug 2016 17:24:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=0DXGN/7K8StZ/QwQ8p+LMuiuTYn03bGEQH6tI6y5Wnk=; b=tfY6Ohw8koy1l8yO+y1RvO1M7y1T9PqVygDjme6PYo17r/UmJmGsae18nXogR+fjw/ l/aBgBDur3qTxp/+3qrEy/idPvpQXS+qnO90bAH5IZ/zxAsAjxxOtUfbyTpodRR5hmZo BrkBPd63bMRM/bxg8W4ZCaONNtHi5ttYpppxkszy6zBsEVZ+BZKtRgsHwTBzQnjv1Jjt 60HKuFnLXHDoNXx35t1e9VN5Ef14aN0EDtEDcNrPF4ddFmBySkx/NOAwf3wbM9akeTbO RlfJTpJdsjIgA3mYAAJ2EbWJYzOnihhirO6wwkCwGohvYe7dZq9LFWjrNNX7z2NazxDl uUHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=0DXGN/7K8StZ/QwQ8p+LMuiuTYn03bGEQH6tI6y5Wnk=; b=BBr2Fvi80QzWqPkwSFqUpV4wlbAbPTEKcirVGz+8t8oHsH/4QvZA9Q2dVbobXyyd5k +VYmhcxlLEmce3M/Z+u0Ltpv5321wUDO8nNWwDrxnZUePGnpwBc0gEhVVlV0lMUrx2sS +AJ+YjlVWRFLMJxbAAoMVw6amJFLo+2ez6bn9OkQAA6LkSmI+gz2ARpg0vtDoS0Tnh03 AdlFh14cr7FvdFiGjoElsMNKmQ6XC39ZBj/lIvLdWDcoJwFBh3+WPdMVPbxVPVMdSNtF bYGaKYnEckr2GzoeFJaJCGXsisWrsoxTUgpul6L8PDBwa/5jNSYlN5T4QZk9CZbUvDDh 3EhA== X-Gm-Message-State: AE9vXwNbXFcRsw5ZPX1emeKW10kInh4st3NDrT+AemBvOr8MYdse9WAoBMMr4f7cXwHAj6JXoaf518kdFuALBQ== X-Received: by 10.157.29.198 with SMTP id w6mr4893910otw.39.1472257488937; Fri, 26 Aug 2016 17:24:48 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.107.204 with HTTP; Fri, 26 Aug 2016 17:24:28 -0700 (PDT) From: =?UTF-8?Q?Jos=C3=A9_Luis_Larroque?= Date: Fri, 26 Aug 2016 21:24:28 -0300 Message-ID: Subject: Giraph application get stuck, on superstep 4, all workers active but without progress To: user@giraph.apache.org Content-Type: multipart/alternative; boundary=001a11407d0aa566d6053b02a846 archived-at: Sat, 27 Aug 2016 00:24:55 -0000 --001a11407d0aa566d6053b02a846 Content-Type: text/plain; charset=UTF-8 Hi again guys! I'm doing BFS search through the Wikipedia (spanish edition) site. I converted the dump ( https://dumps.wikimedia.org/eswiki/20160601) into a file that could be read with Giraph. The BFS is searching for paths, and its all ok until get stuck in some point of the superstep four. I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each node is a r3.8xlarge ec2 instance. The command for executing the BFS is this one: /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true, *giraph.logLevel=Debug* Each container have 120GB (almost). I'm using 1000M messages limit in outOfCore, because i believed that was the problem, but apparently is not. This ones are the master logs (it seems that is waiting for workers for finish but they just don't...and keeps like this forever...): 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false* ...thirty times same last two lines... ... 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 And in *all* workers, there is no information on what is happening (i'm testing this with *giraph.logLevel=Debug* because with the default level of giraph log i was lost), and the workers say this over and over again: 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@7392f34d 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82 Before starting the superstep 4, the information on each worker was the following one 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep: addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID =1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)] 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory (free/total/max) = 92421.41M / 115000.00M / 115000.00M I don't know what is exactly failing: - i know that all containers have memory available, on datanodes i check that each one had like 50 GB available. - I'm not sure if i'm hitting some sort of limit in the use of outOfCore. I know that writing messages too fast is dangerous with 1.1 version of Giraph, but if i hit that limit, i suppose that the container will fail, right? - Maybe the connections for zookeeper client aren't enough? I read that maybe the 60 default value in zookeeper for *maxClientCnxns* is too small for a context like AWS, but i'm not fully aware of the relationship between Giraph and Zookeeper for start changing default configuration values - Maybe i have to tune outOfCore configuration? Using giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true like someone recommend here ( http://mail-archives.apache.org/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%25majakabiljo@fb.com%3E) ? - Should i tune the netty configuration? I have the default configuration, but i believe that maybe using only 8 netty client and 8 server threads will be enough, since that i have only a few workers and maybe too much threads of netty are making the overhead that is doing that entire application get stuck - Using giraph.useBigDataIOForMessages=true didn't help me either, i know that each vertex is receiving 100 M or more messages and that property should be helpful, but didn't make any difference anyway As you maybe are suspecting, i have too many hypothesis, that's why i'm seeking for help, so i can go in the right direction. Any help would be greatly appreciated. Bye! Jose --001a11407d0aa566d6053b02a846 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi again guys!

I'm doing BFS search through the Wikipedia (spa= nish edition) site. I=20 converted the dump= (https://dumps= .wikimedia.org/eswiki/20160601) into a file that could be read with Gir= aph.

The BFS is searching for paths, and its all ok until= get stuck in some point of the superstep four.

I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each= node is a r3.8xlarge ec2 instance. The command for executing the BFS is th= is one:
/home/hadoop/b= in/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.= grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWith= ComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar= .edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputForma= t -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca giraph.= useOutOfCoreMessages=3Dtrue,giraph.metrics.enable=3Dtrue,giraph.maxMessages= InMemory=3D1000000000,giraph.isStaticGraph=3Dtrue,giraph.logLevel= =3DDebug

Each container have 120GB (alm= ost). I'm using 1000M messages limit in outOfCore, because i believed t= hat was the problem, but=C2=A0 apparently is not.

T= his ones are the master logs (it seems that is waiting for workers for fini= sh but they just don't...and keeps like this forever...):

6/08/26 00:43:08 INFO yarn.GiraphYar= nTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on supe= rstep 4
= 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got f= inished worker list =3D [], size =3D 0, worker list =3D [Worker(hostname=3D= ip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D30000), Worker(hostname= =3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D30001), Worker(hostna= me=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002), Worker(host= name=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D30004)], size =3D= 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_application= AttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.Giraph= YarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on s= uperstep 4<= br class=3D"">16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/08/26 = 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false

...th= irty times same last two lines...
...
6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: ta= sk-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
= 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got=20 finished worker list =3D [], size =3D 0, worker list =3D=20 [Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D30000= ),=20 Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D30001)= ,=20 Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002)= ,=20 Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D30004)= ],=20 size =3D 4 from=20 /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttempts= Dir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask= : [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep= 4

<= /span>
And in all workers, there is no information on what is ha= ppening (i'm testing this with giraph.logLevel=3DDebug because w= ith the default level of giraph log i was lost), and the workers say this o= ver and over again:

16/08/26 01:05:08 INFO utils.Progressable= Utils: waitFor: Future result not ready yet java.util.concurrent.FutureTask= @7392f34d
16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiti= ng for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82
Before starting the superstep 4, the information on e= ach worker was the following one
16/08/26 00:43:= 08 INFO yarn.GiraphYarnTask: [STATUS: task-2] startSuperstep: WORKER_ONLY -= Attempt=3D0, Superstep=3D4
16/08/26 00:43:08 DEBUG worker.BspServiceWor= ker: startSuperstep: addressesAndPartitions[Worker(hostname=3Dip-172-31-29-= 14.ec2.internal, MRtaskID=3D0, port=3D30000), Worker(hostname=3Dip-172-31-2= 9-16.ec2.internal, MRtaskID
=3D1, port=3D30001), Worker(hostname=3Dip-17= 2-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002), Worker(hostname=3Dip-= 172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D30004)]
16/08/26 00:43:0= 8 DEBUG worker.BspServiceWorker: 0 Worker(hostname=3Dip-172-31-29-14.ec2.in= ternal, MRtaskID=3D0, port=3D30000)
16/08/26 00:43:08 DEBUG worker.BspSe= rviceWorker: 1 Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1= , port=3D30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 Worke= r(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002)
1= 6/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 Worker(hostname=3Dip-172-= 31-29-14.ec2.internal, MRtaskID=3D4, port=3D30004)
16/08/26 00:43:08 DEB= UG worker.BspServiceWorker: 4 Worker(hostname=3Dip-172-31-29-14.ec2.interna= l, MRtaskID=3D0, port=3D30000)
16/08/26 00:43:08 DEBUG worker.BspService= Worker: 5 Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, por= t=3D30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 Worker(hos= tname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002)
16/08/= 26 00:43:08 DEBUG worker.BspServiceWorker: 7 Worker(hostname=3Dip-172-31-29= -14.ec2.internal, MRtaskID=3D4, port=3D30004)
16/08/26 00:43:08 DEBUG wo= rker.BspServiceWorker: 8 Worker(hostname=3Dip-172-31-29-14.ec2.internal, MR= taskID=3D0, port=3D30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorke= r: 9 Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3= 0001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 Worker(hostnam= e=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002)
16/08/26 0= 0:43:08 DEBUG worker.BspServiceWorker: 11 Worker(hostname=3Dip-172-31-29-14= .ec2.internal, MRtaskID=3D4, port=3D30004)
16/08/26 00:43:08 DEBUG worke= r.BspServiceWorker: 12 Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRta= skID=3D0, port=3D30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker:= 13 Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D30= 001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 Worker(hostname= =3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002)
16/08/26 00= :43:08 DEBUG worker.BspServiceWorker: 15 Worker(hostname=3Dip-172-31-29-14.= ec2.internal, MRtaskID=3D4, port=3D30004)
16/08/26 00:43:08 DEBUG graph.= GraphTaskManager: execute: Memory (free/total/max) =3D 92421.41M / 115000.0= 0M / 115000.00M



I don't know what is exac= tly failing:
- i know that all containers have memory available, on datan= odes i check that each one had like 50 GB available.
- I'm not sure i= f i'm hitting some sort of limit in the use of outOfCore. I know that w= riting messages too fast is dangerous with 1.1 version of Giraph, but if i = hit that limit, i suppose that the container will fail, right?
<= /font>
- Maybe t= he connections for zookeeper client aren't enough? I read that maybe th= e 60 default value in zookeeper for maxClientCnxns is too small for = a context like AWS, but i'm not fully aware of the relationship between= Giraph and Zookeeper for start changing default configuration values
- M= aybe i have to tune outOfCore configuration? Using giraph.maxNumberOfOpenRe= quests and giraph.waitForRequestsConfirmation=3Dtrue like someone recommend here (http://mail-archives.apache.org/mod_mbox/giraph-u= ser/201209.mbox/%3CCC775449.2C4B%25majakabiljo@fb.com%3E) ?
<= /span>
- = Should i tune the netty configuration? I have the default configuration, bu= t i believe that maybe using only 8 netty client and 8 server threads will = be enough, since that i have only a few workers and maybe too much threads = of netty are making the overhead that is doing that entire application get = stuck
- Using giraph.useBigDataIOForMessages=3Dtrue didn't help me eith= er, i know that each vertex is receiving 100 M or more messages and that pr= operty should be helpful, but didn't make any difference anyway<= br>

As you maybe are suspecting, i h= ave too many hypothesis, that's why i'm seeking for help, so i can = go in the right direction.

Any help would be grea= tly appreciated.

Bye!
Jose<= br>




=
--001a11407d0aa566d6053b02a846--