giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Luis Larroque <larroques...@gmail.com>
Subject Re: giraph.numInputThreads execution time for "input superstep" it's the same using 1 or 8 threads, how this can be possible?
Date Fri, 26 Aug 2016 11:52:15 GMT
Hi Hassan, thanks for your answer.

I don't know if is clear (because i'm using my own algorithm instead of
using one of the Giraph examples) but the input file (
 /user/hduser/input/grafo-wikipedia.txt) is loaded in HDFS. i was thinking
that with 8 input splits, giraph was partitioning the input file for
paralell proccesing, but i wasnt' sure.

I will try with only two threads in same worker.

Maybe using giraph.userPartitionCount will help? Or those partitions are
destinated to computation steps only?

Bye!
Jose



2016-08-26 3:11 GMT-03:00 Hassan Eslami <hsn.eslami@gmail.com>:

> It seems that you are reading the data from a single file stored on a
> local machine with multiple threads. Having multiple threads accessing the
> disk causes IO interference which in turn reduces the IO performance. If
> you are reading from a single file on a local machine with 8 threads, the
> results you've got is kind of expected. In such case, you are better off
> using single thread in reading from the disk. You can also try to do it
> with two threads, so that you may be able to get some overlapping benefit
> of reading from disk and deserializing the input.
>
> On Thu, Aug 25, 2016 at 5:36 PM, José Luis Larroque <
> larroquester@gmail.com> wrote:
>
>> he cluster used for this was 1 master and one slave, both of a r3.8xlarge
>> EC2 instance on AWS.
>>
>> 2016-08-25 19:26 GMT-03:00 José Luis Larroque <larroquester@gmail.com>:
>>
>>> I'm doing BFS search through the Wikipedia (spanish edition) site. I
>>> converted the [dump][1] into a file that could be read with Giraph.
>>>
>>> Using 1 worker, a file of 1 GB took 492 seconds. I executed Giraph with
>>> this command:
>>>
>>>     /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
>>> -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
>>> -vip /user/hduser/input/grafo-wikipedia.txt -vof
>>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca
>>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true
>>>
>>> Container logs:
>>>
>>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>>> generateVertexInputSplits: Got 8 input splits for 1 input threads
>>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>>> createVertexInputSplits: Starting to write input split data to zookeeper
>>> with 1 threads
>>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>>> createVertexInputSplits: Done writing input split data to zookeeper
>>>     16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0]
>>> MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed to
>>> start superstep -1
>>>     16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without
>>> authentication.
>>>     16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses:
>>> Successfully added 1 connections, (1 total connected) 0 failed, 0 failures
>>> total.
>>>     16/08/24 21:17:02 INFO partition.PartitionUtils:
>>> computePartitionCount: Creating 1, default would have been 1 partitions.
>>>     ...
>>>     16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client
>>>     16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait
>>> threshold, 1 connections closed, releasing resources now.
>>>     16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted
>>>     16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server
>>>     16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing
>>> resources
>>>     16/08/24 21:25:44 INFO bsp.BspService: process:
>>> cleanedUpChildrenChanged signaled
>>>     16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted
>>>     16/08/24 21:25:47 INFO bsp.BspService: process:
>>> masterElectionChildrenChanged signaled
>>>     16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898
>>> seconds.
>>>     16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took
>>> 452.531 seconds.
>>>     16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took 64.376
>>> seconds.
>>>     16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591
>>> seconds.
>>>     16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609
>>> seconds.
>>>     16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006
>>> seconds.
>>>
>>> As you guys can see, the first line tell us that input superstep is
>>> executing with only **one** thread. And took 492 second in finish Input
>>> Superstep.
>>>
>>> I did another test, using giraph.numInputThreads=8, tryng to do the
>>> input superstep with 8 threads:
>>>
>>>     /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
>>> -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
>>> -vip /user/hduser/input/grafo-wikipedia.txt -vof
>>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca
>>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,
>>> giraph.numInputThreads=8
>>>
>>> The result was the following one:
>>>
>>>         16/08/24 21:54:00 INFO master.BspServiceMaster:
>>> generateVertexInputSplits: Got 8 input splits for 8 input threads
>>>     16/08/24 21:54:00 INFO master.BspServiceMaster:
>>> createVertexInputSplits: Starting to write input split data to zookeeper
>>> with 1 threads
>>>     16/08/24 21:54:00 INFO master.BspServiceMaster:
>>> createVertexInputSplits: Done writing input split data to zookeeper
>>>     ...
>>>
>>>     16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093
>>> seconds.
>>>     16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took
>>> 891.339 seconds.
>>>     16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took 66.635
>>> seconds.
>>>     16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837
>>> seconds.
>>>     16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605
>>> seconds.
>>>     16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512
>>> seconds.
>>>
>>>
>>> So, my question is, how can be possible that Giraph is using 492 seconds
>>> without input threads and 891 seconds with them? Should be exacly the
>>> opposite, right?
>>>
>>>
>>>   [1]: https://dumps.wikimedia.org/eswiki/20160601/ "dump"
>>>
>>
>>
>

Mime
View raw message