Another option is to use the Koblenz network collection [1], which
offers even more (and larger) datasets than Snap.
Best,
Sebastian
[1] http://konect.unikoblenz.de/
On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for similar
> benchmarks. It has a good mix of graph types, sparse/dense, ground truth
> graphs (e.g. social networks that follow power law distribution etc.). So
> far we have observed that the type of graph has a high impact on the
> performance of algorithms that Claudio mentioned.
>
>
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <claudio.martella@gmail.com
>> wrote:
>
>> Hi Wei,
>>
>> it depends on what you mean by workload for a batch processing system. I
>> believe we can split the problem in two: generating a realistic graph, and
>> using "representative" algorithms.
>>
>> To generate graphs we have two options in giraph:
>>
>> 1) random graph: you specify the number of vertices and the number of
>> edges for each vertex, and the edges will connect two random vertices. This
>> creates a graph with (i) low clustering coefficient, (ii) low average path
>> length, (ii) a uniform degree distribution
>>
>> 2) watts strogatz: you specify the number of vertices, the number of
>> edges, and a rewire probability beta. giraph will generate a ring lattice
>> (each vertex is connected to k preceeding vertices and k following
>> vertices) and rewire some of the edges randomly. This will create a graph
>> with (i) high clustering coefficient, (ii) low average path length, (iii)
>> poissonlike degree distribution (depends on beta). This graph will
>> resemble a small world graph such as a social network, except for the
>> degree distribution which will not a power law.
>>
>> To use representative algorithms you can choose:
>>
>> 1) PageRank: it's a ranking algorithm where all the vertices are active
>> and send messages along the edges at each superstep (hence you'll have O(V)
>> active vertices and O(E) messages)
>>
>> 2) Shortest Paths: starting from a random vertex you'll visit al the
>> vertices in the graph (some multiple times). This will have an aggregate
>> O(V) active vertices and O(E) messages, but this is only a lower bound. In
>> general you'l have different areas of the graph explored at each superstep,
>> and hence potentially a varying workload across different supersteps.
>>
>> 3) Connected Components: this will have something opposite to (2) as it
>> will have many active vertices at the beginning, where the detection is
>> refined towards the end.
>>
>>
>> Hope this helps,
>> Claudio
>>
>>
>> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <weiz@us.ibm.com> wrote:
>>
>>> Hi,
>>>
>>> I am interested in measuring some performance numbers of Giraph on my
>>> machine.
>>>
>>> I am wondering are there some pointers where I can get some
>>> (configurable) reasonably large workload to work on ?
>>>
>>> Thanks!
>>>
>>> Wei
>>>
>>
>>
>>
>> 
>> Claudio Martella
>> claudio.martella@gmail.com
>>
>
