giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Zhang <>
Subject Re: workload used to measure Giraph performance number
Date Wed, 09 Oct 2013 02:43:02 GMT

Hi Sebastian,

Thanks a lot for the help! Sorry for the late response!

At this point, I would only need a random graph that complies with
JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the
pagerank example (of giraph) performance.
I am wondering how to convert the data from Koblenz to the  such a graph ?
Is there any pointer of doing this ? (This is the same kind of question
that I raised to Alok on SNAP)



P.S.: I forgot to mention in all my previous emails that I just get started
with distributed graph engine, so please forgive if my questions are too

From:	Sebastian Schelter <>
Date:	10/02/2013 12:41 PM
Subject:	Re: workload used to measure Giraph performance number

Another option is to use the Koblenz network collection [1], which
offers even more (and larger) datasets than Snap.



On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> which we use for similar
> benchmarks. It has a good mix of graph types, sparse/dense, ground truth
> graphs (e.g. social networks that follow power law distribution etc.). So
> far we have observed that the type of graph has a high impact on the
> performance of algorithms that Claudio mentioned.
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella
>> wrote:
>> Hi Wei,
>> it depends on what you mean by workload for a batch processing system. I
>> believe we can split the problem in two: generating a realistic graph,
>> using "representative" algorithms.
>> To generate graphs we have two options in giraph:
>> 1) random graph: you specify the number of vertices and the number of
>> edges for each vertex, and the edges will connect two random vertices.
>> creates a graph with (i) low clustering coefficient, (ii) low average
>> length, (ii) a uniform degree distribution
>> 2) watts strogatz: you specify the number of vertices, the number of
>> edges, and a rewire probability beta. giraph will generate a ring
>> (each vertex is connected to k preceeding vertices and k following
>> vertices) and rewire some of the edges randomly. This will create a
>> with (i) high clustering coefficient, (ii) low average path length,
>> poisson-like degree distribution (depends on beta). This graph will
>> resemble a small world graph such as a social network, except for the
>> degree distribution which will not a power law.
>> To use representative algorithms you can choose:
>> 1) PageRank: it's a ranking algorithm where all the vertices are active
>> and send messages along the edges at each superstep (hence you'll have O
>> active vertices and O(E) messages)
>> 2) Shortest Paths: starting from a random vertex you'll visit al the
>> vertices in the graph (some multiple times). This will have an aggregate
>> O(V) active vertices and O(E) messages, but this is only a lower bound.
>> general you'l have different areas of the graph explored at each
>> and hence potentially a varying workload across different supersteps.
>> 3) Connected Components: this will have something opposite to (2) as it
>> will have many active vertices at the beginning, where the detection is
>> refined towards the end.
>> Hope this helps,
>> Claudio
>> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <> wrote:
>>> Hi,
>>> I am interested in measuring some performance numbers of Giraph on my
>>> machine.
>>> I am wondering are there some pointers where I can get some
>>> (configurable) reasonably large workload to work on ?
>>> Thanks!
>>> Wei
>> --
>>    Claudio Martella

View raw message