giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alok Kumbhare <kumbh...@usc.edu>
Subject Re: Handling vertices with huge number of outgoing edges (and other optimization).
Date Sun, 29 Sep 2013 19:05:36 GMT
Thank you all for such detailed responses. I guess moving to YARN will help
us out with some of these issues. Is there a document that enumerates YARN
specific optimization/advantages and how to take advantage of those.

Thanks,
Alok


On Sun, Sep 29, 2013 at 11:43 AM, Eli Reisman <apache.mailbox@gmail.com>wrote:

> Actually, the data locality is a bit different in Giraph. What happens is
> that when running the non-YARN Giraph profiles, your workers are
> distributed by the Hadoop framework anywhere on the cluster, but once the
> Giraph workers launch, they attempt to claim an Input Split of the total
> input data (which has blocks spread all over the cluster presumably) using
> Apache Zookeeper.
>
> At this point, the worker must load whichever blocks the map to the input
> split the worker claimed. These could potentially be located anywhere on
> the cluster. At this point, the Giraph worker attempts to find some of the
> blocks it claimed on the node it happened to be started on. In this lucky
> situation, it will load those blocks locally. All other data for the input
> split will be pulled across the network from other nodes where the blocks
> reside.
>
> In practice, this form of data locality is more limited than MapReduce
> locality, but does reduce the input stage running time of Giraph jobs in
> many cases. In cases where there is not much input data or few job workers
> relative to the cluster size, this locality scheme is not very effective.
> One assumes the data input stage would be short due to the small scale of
> the job in these cases.
>
>
>
>
>
> On Wed, Sep 25, 2013 at 11:31 PM, Marco Aurelio Barbosa Fagnani Lotz <
> m.a.b.lotz@stu12.qmul.ac.uk> wrote:
>
>>  Hello Alok,
>>
>>  about the question 3.a, i guess the framework will indeed try to
>> allocate the local workers.
>> Each worker is actually a map only task. Due to the behaviour of the
>> Hadoop framework, it will aim for data locality. Therefore, the framework
>> will try to run the map tasks (and thus the workers) in nodes that have
>> local blocks.
>>
>>  Best regards,
>> Marco Aurelio Lotz
>>
>> Sent from my iPhone
>>
>> On 17 Sep 2013, at 18:19, "kumbhare@usc.edu" <kumbhare@usc.edu> wrote:
>>
>>   Hi,
>> We have a moderately sized data set (~20M vertices, ~45M edges).
>>
>>  We are running several basic algorithms e.g connected components,
>> single source shortest paths, page rank etc. The setup includes 12 nodes
>> with 8 core, 16GB ram each. We allow max three mappers per node (-Xmx5G)
>> and run upto 24 giraph workers for the experiments. We are using the trunk
>> version, pulled on 9/2 from the github on hadoop 1.2.1. We use HDFS data
>> store (the file is ~980 MB, with 64MB block size, we get around 15 HDFS
>> blocks)
>>
>>  Input data is in an adjacency list, json format. We use the built in
>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat as
>> the input format.
>>
>>  Given this setup, we have a few questions and appreciate any help to
>> optimize the execution:
>>
>>
>>    1. We observed that the dataset contains most of the vertices (>90%)
>>    with out degree < 20, and some have between 20-1000. However there a few
>>    vertices (<0.5%) with a very high out degree (>100,000).
>>     1. Due to this, most of the workers load data fairly quickly
>>       (~20-30secs), however a couple of workers take a much longer time
>>       (~800secs) to complete just the data input step. Is there a way to handle
>>       such vertices? Or do you suggest using any other input format?
>>        2. Another question we have is if, in general, there is a guide
>>    for choosing various optimization parameters?
>>       1. e.g. number of input/compute/output threads
>>        3. Data Locality and in memory messages:
>>     1. Is there any data locality attempt while running worker?
>>       Basically, out of 12 nodes, if the HDFS blocks for a file are stored only
>>       on say 8 nodes and I run 8 workers, is it guaranteed that the workers will
>>       run on those 8 nodes?
>>       2. Also, if the vertices are located on the same worker, do we
>>       have in memory message transfer between those vertices.
>>        4. Partitioning: We wish to study the effect of different
>>    partitioning schemes on the runtime.
>>     1. Is there a Partitioner we can use that will try to collocate
>>       neighboring vertices on the same worker while balancing different
>>       partitions? (Basically a METIS Partitioner)
>>       2. If we do pre-processing of the data file and store neighboring
>>       vertices close to each other in the file, implying different HDFS blocks
>>       will approximately contain neighboring vertices, and use the default giaph
>>       partitioner, will that help?
>>
>>
>>
>>  I know this is a long mail, and we truly appreciate your help.
>>
>>  Thanks,
>> Alok
>>
>>  Sent from Windows Mail
>>
>>
>

Mime
View raw message