giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <kumbh...@usc.edu>
Subject Handling vertices with huge number of outgoing edges (and other optimization).
Date Tue, 17 Sep 2013 15:39:21 GMT
Hi,

We have a moderately sized data set (~20M vertices, ~45M edges). 


We are running several basic algorithms e.g connected components, single source shortest paths,
page rank etc. The setup includes 12 nodes with 8 core, 16GB ram each. We allow max three
mappers per node (-Xmx5G) and run upto 24 giraph workers for the experiments. We are using
the trunk version, pulled on 9/2 from the github on hadoop 1.2.1. We use HDFS data store (the
file is ~980 MB, with 64MB block size, we get around 15 HDFS blocks)


Input data is in an adjacency list, json format. We use the built in org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
as the input format.


Given this setup, we have a few questions and appreciate any help to optimize the execution:


We observed that the dataset contains most of the vertices (>90%) with out degree <
20, and some have between 20-1000. However there a few vertices (<0.5%) with a very high
out degree (>100,000). 


Due to this, most of the workers load data fairly quickly (~20-30secs), however a couple of
workers take a much longer time (~800secs) to complete just the data input step. Is there
a way to handle such vertices? Or do you suggest using any other input format?

Another question we have is if, in general, there is a guide for choosing various optimization
parameters?

e.g. number of input/compute/output threads


Data Locality and in memory messages:


Is there any data locality attempt while running worker? Basically, out of 12 nodes, if the
HDFS blocks for a file are stored only on say 8 nodes and I run 8 workers, is it guaranteed
that the workers will run on those 8 nodes?


Also, if the vertices are located on the same worker, do we have in memory message transfer
between those vertices.


Partitioning: We wish to study the effect of different partitioning schemes on the runtime.



Is there a Partitioner we can use that will try to collocate neighboring vertices on the same
worker while balancing different partitions? (Basically a METIS Partitioner)


If we do pre-processing of the data file and store neighboring vertices close to each other
in the file, implying different HDFS blocks will approximately contain neighboring vertices,
and use the default giaph partitioner, will that help? 




I know this is a long mail, and we truly appreciate your help.


Thanks,

Alok






Sent from Windows Mail
Mime
View raw message