giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Aurelio Barbosa Fagnani Lotz <m.a.b.l...@stu12.qmul.ac.uk>
Subject Re: Handling vertices with huge number of outgoing edges (and other optimization).
Date Thu, 26 Sep 2013 06:31:37 GMT
Hello Alok,

about the question 3.a, i guess the framework will indeed try to allocate the local workers.
Each worker is actually a map only task. Due to the behaviour of the Hadoop framework, it
will aim for data locality. Therefore, the framework will try to run the map tasks (and thus
the workers) in nodes that have local blocks.

Best regards,
Marco Aurelio Lotz

Sent from my iPhone

On 17 Sep 2013, at 18:19, "kumbhare@usc.edu<mailto:kumbhare@usc.edu>" <kumbhare@usc.edu<mailto:kumbhare@usc.edu>>
wrote:

Hi,
We have a moderately sized data set (~20M vertices, ~45M edges).

We are running several basic algorithms e.g connected components, single source shortest paths,
page rank etc. The setup includes 12 nodes with 8 core, 16GB ram each. We allow max three
mappers per node (-Xmx5G) and run upto 24 giraph workers for the experiments. We are using
the trunk version, pulled on 9/2 from the github on hadoop 1.2.1. We use HDFS data store (the
file is ~980 MB, with 64MB block size, we get around 15 HDFS blocks)

Input data is in an adjacency list, json format. We use the built in org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
as the input format.

Given this setup, we have a few questions and appreciate any help to optimize the execution:


  1.
We observed that the dataset contains most of the vertices (>90%) with out degree <
20, and some have between 20-1000. However there a few vertices (<0.5%) with a very high
out degree (>100,000).
     *
Due to this, most of the workers load data fairly quickly (~20-30secs), however a couple of
workers take a much longer time (~800secs) to complete just the data input step. Is there
a way to handle such vertices? Or do you suggest using any other input format?
  2.  Another question we have is if, in general, there is a guide for choosing various optimization
parameters?
     *
e.g. number of input/compute/output threads
  3.
Data Locality and in memory messages:
     *
Is there any data locality attempt while running worker? Basically, out of 12 nodes, if the
HDFS blocks for a file are stored only on say 8 nodes and I run 8 workers, is it guaranteed
that the workers will run on those 8 nodes?
     *
Also, if the vertices are located on the same worker, do we have in memory message transfer
between those vertices.
  4.
Partitioning: We wish to study the effect of different partitioning schemes on the runtime.
     *
Is there a Partitioner we can use that will try to collocate neighboring vertices on the same
worker while balancing different partitions? (Basically a METIS Partitioner)
     *
If we do pre-processing of the data file and store neighboring vertices close to each other
in the file, implying different HDFS blocks will approximately contain neighboring vertices,
and use the default giaph partitioner, will that help?


I know this is a long mail, and we truly appreciate your help.

Thanks,
Alok

Sent from Windows Mail


Mime
View raw message