incubator-giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-26) Improve PseudoRandomVertexInputFormat to create a more realistic synthetic graph (e.g. power-law distributed vertex-cardinality).
Date Wed, 07 Sep 2011 15:48:10 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099043#comment-13099043
] 

Jake Mannix commented on GIRAPH-26:
-----------------------------------

Dmitriy, I was thinking along the lines of a less-rigorous version of the kroeneker product
technique, as described in that paper.  Essentially you do the following, to find the adjacency
matrix entry at row 1523, column 4918, take four "seed matrices" (most likely pseudorandom
variations on a single fixed seed matrix), each of size 10x10 (could be any base, but it's
easiest to consider decimal notation, and 10 is big enough to actually have a sensible small
graph you're going to kroenker iterate on): 

A_{1523, 4918} = S_{1,4} * T_{5,4} * U_{2,1} * V_{3,8}

This requires no communication between the vertexes, and is totally fixed (ie deterministic)
once you choose your seed matrices.

The trick is to figure out (fast!) which entries are nonzero, for any given vertex:

A_{1523, *} = S_{1,*} * T_{5,*} * U_{2,*} * V_{3,*}

So at this point, you know you're dealing with row 1 of S, row 5 of T, row 2 of U, and row
3 of V.  Well, if each of these matrices is represented sparsely, and you can iterate quickly
over *only* their nonzero entries (and there should be only a few in each row, as we're talking
about a 10-vertex seed graph which is not too connected), and thus iterate over only the nonzero
entries in the edge list for vertex 1523 very quickly as well: if row 1 of S has nonzero entries
at points 1,4,9, row 5 of T has nonzeroes at 5,6, row 2 of U has nonzeroes only at 0, and
row 3 of V has nonzeroes at 0,3,4, then we'll have a total of 3*2*1*3 = 18 nonzero edges for
vertex 1523, and we can iterate over them very quickly.

I think this should make a pretty nice, relatively scale-free graph which has density parametrizable
by initial 10x10 seed graph, with randomness introduced by how you want to create those S,T,U...
variations on that seed.  And it should be embarrassingly parallel.

Anything I'm saying make no sense at all?



> Improve PseudoRandomVertexInputFormat to create a more realistic synthetic graph (e.g.
power-law distributed vertex-cardinality).
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-26
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-26
>             Project: Giraph
>          Issue Type: Test
>          Components: benchmark
>            Reporter: Jake Mannix
>            Priority: Minor
>
> The PageRankBenchmark class, to be a proper benchmark, should run over graphs which look
more like data seen in the wild, and web link graphs, social network graphs, and text corpora
(represented as a bipartite graph) all have power-law distributions, so benchmarking a synthetic
graph which looks more like this would be a nice test which would stress cases of uneven split-distribution
and bottlenecks of subclusters of the graph of heavily connected vertices.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message