Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3BD60200B7F for ; Sun, 28 Aug 2016 02:34:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3A273160AB2; Sun, 28 Aug 2016 00:34:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D61AF160AB0 for ; Sun, 28 Aug 2016 02:34:15 +0200 (CEST) Received: (qmail 22809 invoked by uid 500); 28 Aug 2016 00:34:14 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 22799 invoked by uid 99); 28 Aug 2016 00:34:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 Aug 2016 00:34:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6DB5D180481 for ; Sun, 28 Aug 2016 00:34:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.19 X-Spam-Level: * X-Spam-Status: No, score=1.19 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Il881AIUowZI for ; Sun, 28 Aug 2016 00:34:10 +0000 (UTC) Received: from mail-oi0-f54.google.com (mail-oi0-f54.google.com [209.85.218.54]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id E05B85F4E6 for ; Sun, 28 Aug 2016 00:34:09 +0000 (UTC) Received: by mail-oi0-f54.google.com with SMTP id c15so155861010oig.0 for ; Sat, 27 Aug 2016 17:34:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=GysQc5AMRVvkldCauYe3EiLNrmTkLO2wJPn/TaUwpiw=; b=yhSpEni0g98y/Wy7NajIn99FScbhgWIiTMDSCWkXMaCoEuqqSXCtCoMtrEd/ROSIFO rko7uRF7jSiWxKDf0Y8AUdFCuToNcfYRMaBFOq7sW0bxrCA/Byvd82CRjR2csttFC4dB eZJe8lr0mNsmzLEQsi5mSX89lM8nC56PjZfU7GmYy2CoGTIYOMOsBOZX6LRPJ2zKboXj KkzzrfGWIHHwQ3GkCzW5gHktB5e9xsT1A5fErpQXkucNRrE0ca9I+XP9IUcjlhbaDuwy LiYn3dvJCGi3G0sCuoW5VzfpEmWG/sQZrMNv+If8nHbr2D4ShPNceguPUYIOUboXz8xc 1t3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=GysQc5AMRVvkldCauYe3EiLNrmTkLO2wJPn/TaUwpiw=; b=K9baACdd9J8vL8+sAG7e4fe1MsL0kwA9a67SWgTts9e/XXginqt9VqVhZ5ZjxY+WLu at2VADapvgZzwDSaTnnVm05vwH2TIFjjqhoGP+FBVGqcd5ni3QV8J4bf+TmqiEaoDz0R L4KqzsuILb24xE5e3m+lNq/RgvX8qNbzqIA3XtuUV2USwNNQ/cBn2hyLtItR8ToI0hRh EelWaWg9JZDUzrl/ywBF3nl6BmL6EmE7XyvBDaMDsMtQmzr9aczjtHjtKL1krJwUpcvM QGM4OCm9qTKOesdhVbTctU23++KjxTSBUOUkbEpUvGC/kD3rtY5qZB6DN7eZGd9QSLPA V+Bg== X-Gm-Message-State: AE9vXwOJEanQzBBtLrOG4cIPWHzJtcpBZY1QSEW7e1Kx2yXAH3ytsEu8NqAEamHPj8ci9Ink6JuhkN8nOLh/ww== X-Received: by 10.202.84.195 with SMTP id i186mr7478202oib.200.1472344442816; Sat, 27 Aug 2016 17:34:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.107.204 with HTTP; Sat, 27 Aug 2016 17:33:42 -0700 (PDT) In-Reply-To: References: From: =?UTF-8?Q?Jos=C3=A9_Luis_Larroque?= Date: Sat, 27 Aug 2016 21:33:42 -0300 Message-ID: Subject: Re: Giraph application get stuck, on superstep 4, all workers active but without progress To: user@giraph.apache.org Content-Type: multipart/alternative; boundary=001a113de352804e54053b16e733 archived-at: Sun, 28 Aug 2016 00:34:17 -0000 --001a113de352804e54053b16e733 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Using giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=3Dtrue didn't solve the problem. I duplicated the netty threads, and assigned the double of the original size to netty buffers, and no change. I condensed the messages, 1000 into 1, and get a lot of less messages, but still, same final results. Please, help. 2016-08-26 21:24 GMT-03:00 Jos=C3=A9 Luis Larroque = : > Hi again guys! > > I'm doing BFS search through the Wikipedia (spanish edition) site. I > converted the dump ( > https://dumps.wikimedia.org/eswiki/20160601) into a file that could be > read with Giraph. > > The BFS is searching for paths, and its all ok until get stuck in some > point of the superstep four. > > I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each nod= e > is a r3.8xlarge ec2 instance. The command for executing the BFS is this o= ne: > /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar > ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionale > sWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras. > IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia= .txt > -vof ar.edu.info.unlp.tesina.vertice.estructuras. > IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegac= ionales > -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=3D > true,giraph.metrics.enable=3Dtrue,giraph.maxMessagesInMemory=3D > 1000000000,giraph.isStaticGraph=3Dtrue,*giraph.logLevel=3DDebug* > > Each container have 120GB (almost). I'm using 1000M messages limit in > outOfCore, because i believed that was the problem, but apparently is no= t. > > This ones are the master logs (it seems that is waiting for workers for > finish but they just don't...and keeps like this forever...): > > 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got > finished worker list =3D [], size =3D 0, worker list =3D > [Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D300= 00), > Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000= 1), > Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000= 2), > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000= 4)], > size =3D 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_ > 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir > 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > > *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for > 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed > signaled of false* > ...thirty times same last two lines... > ... > 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got > finished worker list =3D [], size =3D 0, worker list =3D > [Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D300= 00), > Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000= 1), > Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000= 2), > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000= 4)], > size =3D 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_ > 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir > 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > > And in *all* workers, there is no information on what is happening (i'm > testing this with *giraph.logLevel=3DDebug* because with the default leve= l > of giraph log i was lost), and the workers say this over and over again: > > 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result no= t > ready yet java.util.concurrent.FutureTask@7392f34d > 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for > org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82 > > Before starting the superstep 4, the information on each worker was the > following one > 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] > startSuperstep: WORKER_ONLY - Attempt=3D0, Superstep=3D4 > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep: > addressesAndPartitions[Worker(hostname=3Dip-172-31-29-14.ec2.internal, > MRtaskID=3D0, port=3D30000), Worker(hostname=3Dip-172-31-29-16.ec2.intern= al, > MRtaskID > =3D1, port=3D30001), Worker(hostname=3Dip-172-31-29-15.ec2.internal, > MRtaskID=3D2, port=3D30002), Worker(hostname=3Dip-172-31-29-14.ec2.intern= al, > MRtaskID=3D4, port=3D30004)] > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000= 0) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 > Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000= 1) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 > Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000= 2) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000= 4) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000= 0) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5 > Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000= 1) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 > Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000= 2) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000= 4) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000= 0) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9 > Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000= 1) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 > Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000= 2) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000= 4) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D3000= 0) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13 > Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3000= 1) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 > Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3000= 2) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15 > Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3000= 4) > 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory > (free/total/max) =3D 92421.41M / 115000.00M / 115000.00M > > > I don't know what is exactly failing: > - i know that all containers have memory available, on datanodes i check > that each one had like 50 GB available. > - I'm not sure if i'm hitting some sort of limit in the use of outOfCore. > I know that writing messages too fast is dangerous with 1.1 version of > Giraph, but if i hit that limit, i suppose that the container will fail, > right? > - Maybe the connections for zookeeper client aren't enough? I read that > maybe the 60 default value in zookeeper for *maxClientCnxns* is too small > for a context like AWS, but i'm not fully aware of the relationship betwe= en > Giraph and Zookeeper for start changing default configuration values > - Maybe i have to tune outOfCore configuration? Using > giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=3Dt= rue > like someone recommend here (http://mail-archives.apache. > org/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B% > 25majakabiljo@fb.com%3E) ? > - Should i tune the netty configuration? I have the default configuration= , > but i believe that maybe using only 8 netty client and 8 server threads > will be enough, since that i have only a few workers and maybe too much > threads of netty are making the overhead that is doing that entire > application get stuck > - Using giraph.useBigDataIOForMessages=3Dtrue didn't help me either, i kn= ow > that each vertex is receiving 100 M or more messages and that property > should be helpful, but didn't make any difference anyway > > As you maybe are suspecting, i have too many hypothesis, that's why i'm > seeking for help, so i can go in the right direction. > > Any help would be greatly appreciated. > > Bye! > Jose > > > > > --001a113de352804e54053b16e733 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Using giraph.maxNumberOfOpenRequests and giraph.waitFor= RequestsConfirmation=3Dtrue didn't solve the problem.

I duplicated the netty threads, and assigned the double of the original = size to netty buffers, and no change.

I condensed the messages, 1000 into 1, and get a lot of less messages, b= ut still, same final results.

Please, help.


2016-08-26 21:24 GMT-0= 3:00 Jos=C3=A9 Luis Larroque <larroquester@gmail.com>:<= br>
Hi again guys!

I'm doing BFS= search through the Wikipedia (spanish edition) site. I=20 converted the dump (https://dumps.wikimedia.org/eswiki/20160601= ) into a file that could be read with Giraph.

The BFS is = searching for paths, and its all ok until get stuck in some point of the su= perstep four.

I'm using a cluster of 5 nod= es (4 slaves core, 1 Master) on AWS. Each node is a r3.8xlarge ec2 instance= . The command for executing the BFS is this one:
/home/hadoop/bin/yarn jar /home/hadoo= p/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.Busqued= aDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.<= wbr>IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/camin= osNavegacionales -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=3Dtru= e,giraph.metrics.enable=3Dtrue,giraph.maxMessagesInMemory=3D= 1000000000,giraph.isStaticGraph=3Dtrue,giraph.logLevel= =3DDebug

Each container have 120GB (alm= ost). I'm using 1000M messages limit in outOfCore, because i believed t= hat was the problem, but=C2=A0 apparently is not.

T= his ones are the master logs (it seems that is waiting for workers for fini= sh but they just don't...and keeps like this forever...):

6/08/26 00:43:08 INFO ya= rn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out = of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got = finished worker list =3D [], size =3D 0, worker list =3D [Worker(hostname= =3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D30000), Worker(h= ostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D30001), W= orker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30= 002), Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, po= rt=3D30004)], size =3D 4 from /_hadoopBsp/giraph_yarn_application_1472= 168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_wo= rkerFinishedDir
16/= 08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_= ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMse= cs: Wait for 10000
= 16/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of = false
...thirty= times same last two lines...
...
6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [ST= ATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got=20 finished worker list =3D [], size =3D 0, worker list =3D=20 [Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D= 30000),=20 Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D3= 0001),=20 Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D3= 0002),=20 Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D3= 0004)],=20 size =3D 4 from=20 /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicati= onAttemptsDir/0/_superstepDir/4/_workerFinishedDir

16/08/26 00:43:08 INFO yarn.Girap= hYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on = superstep 4=

And in all workers, there is no information on w= hat is happening (i'm testing this with giraph.logLevel=3DDebug = because with the default level of giraph log i was lost), and the workers s= ay this over and over again:

16/08/26 01:05:08 INFO utils.Pro= gressableUtils: waitFor: Future result not ready yet java.util.concurrent.<= wbr>FutureTask@7392f34d
16/08/26 01:05:08 INFO utils.ProgressableUtils: = waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$Fu= tureWaitable@34a37f82

Before starting the superstep= 4, the information on each worker was the following one
<= /div>
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] start= Superstep: WORKER_ONLY - Attempt=3D0, Superstep=3D4
16/08/26 00:43:08 DE= BUG worker.BspServiceWorker: startSuperstep: addressesAndPartitions[Worker(= hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D300= 00), Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID
=3D1,= port=3D30001), Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtask= ID=3D2, port=3D30002), Worker(hostname=3Dip-172-31-29-14.ec2.internal,= MRtaskID=3D4, port=3D30004)]
16/08/26 00:43:08 DEBUG worker.BspServiceW= orker: 0 Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D0,= port=3D30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 Worker= (hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1, port=3D30001)<= br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 Worker(hostname=3Dip-= 172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002)
16/08/26 00:= 43:08 DEBUG worker.BspServiceWorker: 3 Worker(hostname=3Dip-172-31-29-= 14.ec2.internal, MRtaskID=3D4, port=3D30004)
16/08/26 00:43:08 DEBUG wor= ker.BspServiceWorker: 4 Worker(hostname=3Dip-172-31-29-14.ec2.internal= , MRtaskID=3D0, port=3D30000)
16/08/26 00:43:08 DEBUG worker.BspServiceW= orker: 5 Worker(hostname=3Dip-172-31-29-16.ec2.internal, MRtaskID=3D1,= port=3D30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 Worker= (hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2, port=3D30002)<= br>16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 Worker(hostname=3Dip-= 172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D30004)
16/08/26 00:= 43:08 DEBUG worker.BspServiceWorker: 8 Worker(hostname=3Dip-172-31-29-= 14.ec2.internal, MRtaskID=3D0, port=3D30000)
16/08/26 00:43:08 DEBUG wor= ker.BspServiceWorker: 9 Worker(hostname=3Dip-172-31-29-16.ec2.internal= , MRtaskID=3D1, port=3D30001)
16/08/26 00:43:08 DEBUG worker.BspServiceW= orker: 10 Worker(hostname=3Dip-172-31-29-15.ec2.internal, MRtaskID=3D2= , port=3D30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 Work= er(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskID=3D4, port=3D30004= )
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 Worker(hostname=3D= ip-172-31-29-14.ec2.internal, MRtaskID=3D0, port=3D30000)
16/08/26 = 00:43:08 DEBUG worker.BspServiceWorker: 13 Worker(hostname=3Dip-172-31-29-<= wbr>16.ec2.internal, MRtaskID=3D1, port=3D30001)
16/08/26 00:43:08 DEBUG= worker.BspServiceWorker: 14 Worker(hostname=3Dip-172-31-29-15.ec2.int= ernal, MRtaskID=3D2, port=3D30002)
16/08/26 00:43:08 DEBUG worker.BspSer= viceWorker: 15 Worker(hostname=3Dip-172-31-29-14.ec2.internal, MRtaskI= D=3D4, port=3D30004)
16/08/26 00:43:08 DEBUG graph.GraphTaskManager: exe= cute: Memory (free/total/max) =3D 92421.41M / 115000.00M / 115000.00M


I don't know what is exactly failing:
- i kn= ow that all containers have memory available, on datanodes i check that eac= h one had like 50 GB available.
- I'm not sure if i'm hitting som= e sort of limit in the use of outOfCore. I know that writing messages too f= ast is dangerous with 1.1 version of Giraph, but if i hit that limit, i sup= pose that the container will fail, right?
- Maybe the connections for zo= okeeper client aren't enough? I read that maybe the 60 default value in= zookeeper for maxClientCnxns is too small for a context like AWS, b= ut i'm not fully aware of the relationship between Giraph and Zookeeper= for start changing default configuration values
- Maybe i have to tune o= utOfCore configuration? Using giraph.maxNumberOfOpenRequests and gir= aph.waitForRequestsConfirmation=3Dtrue like someone recommend here (http://mail-archives.apache.org/mod_m= box/giraph-user/201209.mbox/%3CCC775449.2C4B%25majakabiljo@fb.com= %3E) ?
- Should i tune the netty configuration? I have the def= ault configuration, but i believe that maybe using only 8 netty client and = 8 server threads will be enough, since that i have only a few workers and m= aybe too much threads of netty are making the overhead that is doing that e= ntire application get stuck
- Using giraph.useBigDataIOFor= Messages=3Dtrue didn't help me either, i know that each vertex is recei= ving 100 M or more messages and that property should be helpful, but didn&#= 39;t make any difference anyway

As you maybe are suspecting, i have too many hypothesis, that's = why i'm seeking for help, so i can go in the right direction.

= Any help would be greatly appreciated.

Bye!
Jose


<= div>


--001a113de352804e54053b16e733--