Hi Claudio,
Thanks a lot for the help! Sorry for the late response.
My question was more aligned with your first aspect (to generate graphs) of
problem domain.
I think I can start by running the SimplePageRank example. I have been able
to run some very small data set ( couple of vertices and tens of edges)
over Giraph. The data set is created manually by myself. And the vertex
input format complies with JsonLongDoubleFloatDoubleVertexInputFormat
(following the instructions on the giraph website, unfortunately this is
the only vertex input format that I am aware of. If you think I should use
other vertex input format to ease the performance measuring of Giraph or
other Graph engine, please let me know. I do intend to potentially try with
different Graph engine(s) and compare the numbers. )
To keep everything simple at the beginning, I can start with experimenting
with random graph. So my concrete question now is how to generate a random
graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat, so I
can use this random graph as an input to measure some number of Giraph's
pagerank example.
Thank you very much!
Wei
From: Claudio Martella <claudio.martella@gmail.com>
To: "user@giraph.apache.org" <user@giraph.apache.org>,
Date: 10/02/2013 11:32 AM
Subject: Re: workload used to measure Giraph performance number
Hi Wei,
it depends on what you mean by workload for a batch processing system. I
believe we can split the problem in two: generating a realistic graph, and
using "representative" algorithms.
To generate graphs we have two options in giraph:
1) random graph: you specify the number of vertices and the number of edges
for each vertex, and the edges will connect two random vertices. This
creates a graph with (i) low clustering coefficient, (ii) low average path
length, (ii) a uniform degree distribution
2) watts strogatz: you specify the number of vertices, the number of edges,
and a rewire probability beta. giraph will generate a ring lattice (each
vertex is connected to k preceeding vertices and k following vertices) and
rewire some of the edges randomly. This will create a graph with (i) high
clustering coefficient, (ii) low average path length, (iii) poissonlike
degree distribution (depends on beta). This graph will resemble a small
world graph such as a social network, except for the degree distribution
which will not a power law.
To use representative algorithms you can choose:
1) PageRank: it's a ranking algorithm where all the vertices are active and
send messages along the edges at each superstep (hence you'll have O(V)
active vertices and O(E) messages)
2) Shortest Paths: starting from a random vertex you'll visit al the
vertices in the graph (some multiple times). This will have an aggregate O
(V) active vertices and O(E) messages, but this is only a lower bound. In
general you'l have different areas of the graph explored at each superstep,
and hence potentially a varying workload across different supersteps.
3) Connected Components: this will have something opposite to (2) as it
will have many active vertices at the beginning, where the detection is
refined towards the end.
Hope this helps,
Claudio
On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <weiz@us.ibm.com> wrote:
Hi,
I am interested in measuring some performance numbers of Giraph on my
machine.
I am wondering are there some pointers where I can get some
(configurable) reasonably large workload to work on ?
Thanks!
Wei

Claudio Martella
claudio.martella@gmail.com
