giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Puneet Jain <puneetdabu...@gmail.com>
Subject Millions of node and thousands of edges
Date Thu, 18 Jul 2013 00:25:28 GMT
Hello:

I have a graph with over million nodes and each node may be connected to
thousands of edges. My graph is stored in hbase as :

<source, colon_sep_list_of_connected_nodes>

I have thousands of such rows in my HBase table. I am facing issue in
running standard algorithms such as PageRank, ConnectedComponents because
of mapper timeouts. I am able to fix these issues if I reduce number of
outgoing edges to few hundreds (by doing partial analysis). While one
solution of this issue could be to increase hadoop mapper timeouts or
hbase/zk scanner timeouts. I would like to see if giraph is intelligent
enough in figuring out the following:

1. In vertex input format of giraph, we create various vertices and edges.
What if I split by hbase rows into multiple rows, such that no row have
more than X number of neighbours.

So:
<source, colon_sep_list_of_connected_nodes_part1>
<source, colon_sep_list_of_connected_nodes_part2>
<source, colon_sep_list_of_connected_nodes_part3>
............................
<source, colon_sep_list_of_connected_nodes_partn>

This will create multiple mappers for each row, but I am afraid if giraph
will determine that multiple nodes with same id but smaller number of edges
are actually the same vertex, with millions of edges.

I am also wondering how can I create bidirectional edges in the giraph. Do
I have to modify my input tables to contain two rows .. one from a-->b and
another from b-->a ... Is it not possible to do by keeping only one record
in the table.

Thanks
Puneet

-- 
--Puneet

Mime
View raw message