giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hassan Eslami <hsn.esl...@gmail.com>
Subject Re: giraph.numInputThreads execution time for "input superstep" it's the same using 1 or 8 threads, how this can be possible?
Date Fri, 26 Aug 2016 18:58:29 GMT
In your case, it seems that the HDFS data node is the same as the worker.
So again, you will be having multiple threads reading multiple locations of
HDFS (which means multiple locations on a single disk). If this is the
case, as I mentioned in the last email, it causes IO interference.

giraph.userPartitionCount is not much relevant in input superstep (unless
you truly can get benefit from parallelism reading multiple input splits at
the same time, which reduces the lock contention on each partition while
reading).

On Fri, Aug 26, 2016 at 6:52 AM, José Luis Larroque <larroquester@gmail.com>
wrote:

> Hi Hassan, thanks for your answer.
>
> I don't know if is clear (because i'm using my own algorithm instead of
> using one of the Giraph examples) but the input file (
>  /user/hduser/input/grafo-wikipedia.txt) is loaded in HDFS. i was
> thinking that with 8 input splits, giraph was partitioning the input file
> for paralell proccesing, but i wasnt' sure.
>
> I will try with only two threads in same worker.
>
> Maybe using giraph.userPartitionCount will help? Or those partitions are
> destinated to computation steps only?
>
> Bye!
> Jose
>
>
>
> 2016-08-26 3:11 GMT-03:00 Hassan Eslami <hsn.eslami@gmail.com>:
>
>> It seems that you are reading the data from a single file stored on a
>> local machine with multiple threads. Having multiple threads accessing the
>> disk causes IO interference which in turn reduces the IO performance. If
>> you are reading from a single file on a local machine with 8 threads, the
>> results you've got is kind of expected. In such case, you are better off
>> using single thread in reading from the disk. You can also try to do it
>> with two threads, so that you may be able to get some overlapping benefit
>> of reading from disk and deserializing the input.
>>
>> On Thu, Aug 25, 2016 at 5:36 PM, José Luis Larroque <
>> larroquester@gmail.com> wrote:
>>
>>> he cluster used for this was 1 master and one slave, both of a
>>> r3.8xlarge EC2 instance on AWS.
>>>
>>> 2016-08-25 19:26 GMT-03:00 José Luis Larroque <larroquester@gmail.com>:
>>>
>>>> I'm doing BFS search through the Wikipedia (spanish edition) site. I
>>>> converted the [dump][1] into a file that could be read with Giraph.
>>>>
>>>> Using 1 worker, a file of 1 GB took 492 seconds. I executed Giraph with
>>>> this command:
>>>>
>>>>     /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>>>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
>>>> -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
>>>> -vip /user/hduser/input/grafo-wikipedia.txt -vof
>>>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>>>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca
>>>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true
>>>>
>>>> Container logs:
>>>>
>>>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>>>> generateVertexInputSplits: Got 8 input splits for 1 input threads
>>>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>>>> createVertexInputSplits: Starting to write input split data to zookeeper
>>>> with 1 threads
>>>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>>>> createVertexInputSplits: Done writing input split data to zookeeper
>>>>     16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0]
>>>> MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed
to
>>>> start superstep -1
>>>>     16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without
>>>> authentication.
>>>>     16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses:
>>>> Successfully added 1 connections, (1 total connected) 0 failed, 0 failures
>>>> total.
>>>>     16/08/24 21:17:02 INFO partition.PartitionUtils:
>>>> computePartitionCount: Creating 1, default would have been 1 partitions.
>>>>     ...
>>>>     16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client
>>>>     16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait
>>>> threshold, 1 connections closed, releasing resources now.
>>>>     16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted
>>>>     16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server
>>>>     16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing
>>>> resources
>>>>     16/08/24 21:25:44 INFO bsp.BspService: process:
>>>> cleanedUpChildrenChanged signaled
>>>>     16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted
>>>>     16/08/24 21:25:47 INFO bsp.BspService: process:
>>>> masterElectionChildrenChanged signaled
>>>>     16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898
>>>> seconds.
>>>>     16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took
>>>> 452.531 seconds.
>>>>     16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took
>>>> 64.376 seconds.
>>>>     16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591
>>>> seconds.
>>>>     16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609
>>>> seconds.
>>>>     16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006
>>>> seconds.
>>>>
>>>> As you guys can see, the first line tell us that input superstep is
>>>> executing with only **one** thread. And took 492 second in finish Input
>>>> Superstep.
>>>>
>>>> I did another test, using giraph.numInputThreads=8, tryng to do the
>>>> input superstep with 8 threads:
>>>>
>>>>     /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>>>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
>>>> -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
>>>> -vip /user/hduser/input/grafo-wikipedia.txt -vof
>>>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>>>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca
>>>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,
>>>> giraph.numInputThreads=8
>>>>
>>>> The result was the following one:
>>>>
>>>>         16/08/24 21:54:00 INFO master.BspServiceMaster:
>>>> generateVertexInputSplits: Got 8 input splits for 8 input threads
>>>>     16/08/24 21:54:00 INFO master.BspServiceMaster:
>>>> createVertexInputSplits: Starting to write input split data to zookeeper
>>>> with 1 threads
>>>>     16/08/24 21:54:00 INFO master.BspServiceMaster:
>>>> createVertexInputSplits: Done writing input split data to zookeeper
>>>>     ...
>>>>
>>>>     16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093
>>>> seconds.
>>>>     16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took
>>>> 891.339 seconds.
>>>>     16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took
>>>> 66.635 seconds.
>>>>     16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837
>>>> seconds.
>>>>     16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605
>>>> seconds.
>>>>     16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512
>>>> seconds.
>>>>
>>>>
>>>> So, my question is, how can be possible that Giraph is using 492
>>>> seconds without input threads and 891 seconds with them? Should be exacly
>>>> the opposite, right?
>>>>
>>>>
>>>>   [1]: https://dumps.wikimedia.org/eswiki/20160601/ "dump"
>>>>
>>>
>>>
>>
>

Mime
View raw message