giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Luis Larroque <larroques...@gmail.com>
Subject Re: Giraph application get stuck, on superstep 4, all workers active but without progress
Date Sun, 28 Aug 2016 20:04:00 GMT
Problem solved, i optimize the proccesing of each message, and i could
solve it.

Sorry for the spam guys :D

Bye!
Jose

2016-08-28 15:23 GMT-03:00 José Luis Larroque <larroquester@gmail.com>:

> Ok, i understand what is happening now.
>
> I starting to use more compute threads, because i believed that the
> problem was scalability. I started the application again, using :
> giraph.numComputeThreads=15 (r3.8xlarge has 32 cores)
> giraph.userPartitionCount=240 (4 for each computing thread)
>
> The application gets stuck on only one thread, and only in one partition.
> In this partition, i'm doing a small processing of each message. I have to
> add the vertex id to the end of each message, in order to have the result
> for the Output of that vertex.
>
> The problem here remains in that small process of each message is taking
> to long, and i have the entire cluster waiting for it. I Know that there
> are other tecnologies por post-processing results, maybe i should use one
> of them?
>
> Bye!
> Jose
>
> 2016-08-27 21:33 GMT-03:00 José Luis Larroque <larroquester@gmail.com>:
>
>> Using giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true
>> didn't solve the problem.
>>
>> I duplicated the netty threads, and assigned the double of the original
>> size to netty buffers, and no change.
>>
>> I condensed the messages, 1000 into 1, and get a lot of less messages,
>> but still, same final results.
>>
>> Please, help.
>>
>> 2016-08-26 21:24 GMT-03:00 José Luis Larroque <larroquester@gmail.com>:
>>
>>> Hi again guys!
>>>
>>> I'm doing BFS search through the Wikipedia (spanish edition) site. I
>>> converted the dump <https://dumps.wikimedia.org/eswiki/20160601/> (
>>> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be
>>> read with Giraph.
>>>
>>> The BFS is searching for paths, and its all ok until get stuck in some
>>> point of the superstep four.
>>>
>>> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each
>>> node is a r3.8xlarge ec2 instance. The command for executing the BFS is
>>> this one:
>>> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNaveg
>>> acionalesWikiquote -vif ar.edu.info.unlp.tesina.vertic
>>> e.estructuras.IdTextWithComplexValueInputFormat -vip
>>> /user/hduser/input/grafo-wikipedia.txt -vof
>>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>>> -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca
>>> giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,
>>> giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true,
>>> *giraph.logLevel=Debug*
>>>
>>> Each container have 120GB (almost). I'm using 1000M messages limit in
>>> outOfCore, because i believed that was the problem, but  apparently is not.
>>>
>>> This ones are the master logs (it seems that is waiting for workers for
>>> finish but they just don't...and keeps like this forever...):
>>>
>>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList:
>>> Got finished worker list = [], size = 0, worker list =
>>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>>> size = 4 from /_hadoopBsp/giraph_yarn_applic
>>> ation_1472168758138_0002/_applicationAttemptsDir/0/_superste
>>> pDir/4/_workerFinishedDir
>>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>>
>>> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
>>> 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
>>> signaled of false*
>>> ...thirty times same last two lines...
>>> ...
>>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList:
>>> Got finished worker list = [], size = 0, worker list =
>>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>>> size = 4 from /_hadoopBsp/giraph_yarn_applic
>>> ation_1472168758138_0002/_applicationAttemptsDir/0/_superste
>>> pDir/4/_workerFinishedDir
>>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>>
>>> And in *all* workers, there is no information on what is happening (i'm
>>> testing this with *giraph.logLevel=Debug* because with the default
>>> level of giraph log i was lost), and the workers say this over and over
>>> again:
>>>
>>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result
>>> not ready yet java.util.concurrent.FutureTask@7392f34d
>>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for
>>> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82
>>>
>>> Before starting the superstep 4, the information on each worker was the
>>> following one
>>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2]
>>> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep:
>>> addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal,
>>> MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal,
>>> MRtaskID
>>> =1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal,
>>> MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal,
>>> MRtaskID=4, port=30004)]
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>>> 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory
>>> (free/total/max) = 92421.41M / 115000.00M / 115000.00M
>>>
>>>
>>> I don't know what is exactly failing:
>>> - i know that all containers have memory available, on datanodes i check
>>> that each one had like 50 GB available.
>>> - I'm not sure if i'm hitting some sort of limit in the use of
>>> outOfCore. I know that writing messages too fast is dangerous with 1.1
>>> version of Giraph, but if i hit that limit, i suppose that the container
>>> will fail, right?
>>> - Maybe the connections for zookeeper client aren't enough? I read that
>>> maybe the 60 default value in zookeeper for *maxClientCnxns* is too
>>> small for a context like AWS, but i'm not fully aware of the relationship
>>> between Giraph and Zookeeper for start changing default configuration values
>>> - Maybe i have to tune outOfCore configuration? Using
>>> giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirma
>>> tion=true like someone recommend here (http://mail-archives.apache.o
>>> rg/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%25majak
>>> abiljo@fb.com%3E) ?
>>> - Should i tune the netty configuration? I have the default
>>> configuration, but i believe that maybe using only 8 netty client and 8
>>> server threads will be enough, since that i have only a few workers and
>>> maybe too much threads of netty are making the overhead that is doing that
>>> entire application get stuck
>>> - Using giraph.useBigDataIOForMessages=true didn't help me either, i
>>> know that each vertex is receiving 100 M or more messages and that property
>>> should be helpful, but didn't make any difference anyway
>>>
>>> As you maybe are suspecting, i have too many hypothesis, that's why i'm
>>> seeking for help, so i can go in the right direction.
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Bye!
>>> Jose
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message