giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: workload used to measure Giraph performance number
Date Thu, 10 Oct 2013 17:57:28 GMT
We don't have a data generator that produces HDFS files, but 
PageRankBenchmark does have a data generator (see 
PseudoRandomIntNullVertexInputFormat)

Hope that helps,

Avery

On 10/9/13 2:49 PM, Wei Zhang wrote:
>
> Hi Avery,
>
> Thanks a lot for the pointer! I haven't quite got to the point where I 
> can tune the Giraph performance yet (hopefully I can get there soon)
>
> I am still at the early stage of finding/generating the right workload 
> to measure the performance. I am wondering is there some pointer (in 
> Giraph or elsewhere) that I can use to generate a random or 
> representative workload (for pagerank, for now), which complies with 
> JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is greatly 
> appreciated. -- Sorry for being a slow starter.
>
> I have written a small program myself to generate a random graph that 
> has an average out-degree of 6 connected graph that complies with the 
> JSON format. For a 2000-vertice graph, it takes about 12 seconds to 
> run the pagerank example provided by Giraph (30 supersteps)
> I am only using 1 core of my machine (as I run in the Hadoop's "local" 
> mode so that I could jdb into the Giraph job to observe how code runs 
> if needed). Each core (out of 8 in total) of my machine is Intel(R) 
> Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in total. 
> Assuming the linear speedup when turning on all the 8 cores, it is 
> about 1.5 seconds to run the work -- not sure if it is a reasonable 
> number or not .
>
> Thanks!
>
> Wei
> Inactive hide details for Avery Ching ---10/09/2013 05:11:39 PM---Hi 
> Wei, For best performance, please be sure to tune the GC sAvery Ching 
> ---10/09/2013 05:11:39 PM---Hi Wei, For best performance, please be 
> sure to tune the GC settings, use Java
>
> From: Avery Ching <aching@apache.org>
> To: user@giraph.apache.org,
> Date: 10/09/2013 05:11 PM
> Subject: Re: workload used to measure Giraph performance number
>
> ------------------------------------------------------------------------
>
>
>
> Hi Wei,
>
> For best performance, please be sure to tune the GC settings, use Java 
> 7, tune the number of cores used for computation, communication, etc. 
> and the combiner.
>
> We also have some numbers on our recent Facebook blog post.
> _
> __https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920_
>
> Avery
>
> On 10/8/13 7:43 PM, Wei Zhang wrote:
>
>
>     Hi Sebastian,
>
>     Thanks a lot for the help! Sorry for the late response!
>
>     At this point, I would only need a random graph that complies with
>     JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure
>     the pagerank example (of giraph) performance.
>     I am wondering how to convert the data from Koblenz to the  such a
>     graph ? Is there any pointer of doing this ? (This is the same
>     kind of question that I raised to Alok on SNAP)
>
>     Thanks!
>
>     Wei
>
>     P.S.: I forgot to mention in all my previous emails that I just
>     get started with distributed graph engine, so please forgive if my
>     questions are too naive.
>
>     Inactive hide details for Sebastian Schelter ---10/02/2013
>     12:41:27 PM---Another option is to use the Koblenz network
>     collectioSebastian Schelter ---10/02/2013 12:41:27 PM---Another
>     option is to use the Koblenz network collection [1], which offers
>     even more (and larger) dat
>
>     From: Sebastian Schelter _<ssc@apache.org>_ <mailto:ssc@apache.org>
>     To: _user@giraph.apache.org_ <mailto:user@giraph.apache.org>,
>     Date: 10/02/2013 12:41 PM
>     Subject: Re: workload used to measure Giraph performance number
>     ------------------------------------------------------------------------
>
>
>
>     Another option is to use the Koblenz network collection [1], which
>     offers even more (and larger) datasets than Snap.
>
>     Best,
>     Sebastian
>
>     [1] _http://konect.uni-koblenz.de/_
>
>
>     On 02.10.2013 17:41, Alok Kumbhare wrote:
>     > There are a number real (medium sized) graphs at
>     > _http://snap.stanford.edu/data/index.html_ which we use for similar
>     > benchmarks. It has a good mix of graph types, sparse/dense,
>     ground truth
>     > graphs (e.g. social networks that follow power law distribution
>     etc.). So
>     > far we have observed that the type of graph has a high impact on the
>     > performance of algorithms that Claudio mentioned.
>     >
>     >
>     > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella
>     <_claudio.martella@gmail.com_ <mailto:claudio.martella@gmail.com>
>     >> wrote:
>     >
>     >> Hi Wei,
>     >>
>     >> it depends on what you mean by workload for a batch processing
>     system. I
>     >> believe we can split the problem in two: generating a realistic
>     graph, and
>     >> using "representative" algorithms.
>     >>
>     >> To generate graphs we have two options in giraph:
>     >>
>     >> 1) random graph: you specify the number of vertices and the
>     number of
>     >> edges for each vertex, and the edges will connect two random
>     vertices. This
>     >> creates a graph with (i) low clustering coefficient, (ii) low
>     average path
>     >> length, (ii) a uniform degree distribution
>     >>
>     >> 2) watts strogatz: you specify the number of vertices, the
>     number of
>     >> edges, and a rewire probability beta. giraph will generate a
>     ring lattice
>     >> (each vertex is connected to k preceeding vertices and k following
>     >> vertices) and rewire some of the edges randomly. This will
>     create a graph
>     >> with (i) high clustering coefficient, (ii) low average path
>     length, (iii)
>     >> poisson-like degree distribution (depends on beta). This graph will
>     >> resemble a small world graph such as a social network, except
>     for the
>     >> degree distribution which will not a power law.
>     >>
>     >> To use representative algorithms you can choose:
>     >>
>     >> 1) PageRank: it's a ranking algorithm where all the vertices
>     are active
>     >> and send messages along the edges at each superstep (hence
>     you'll have O(V)
>     >> active vertices and O(E) messages)
>     >>
>     >> 2) Shortest Paths: starting from a random vertex you'll visit
>     al the
>     >> vertices in the graph (some multiple times). This will have an
>     aggregate
>     >> O(V) active vertices and O(E) messages, but this is only a
>     lower bound. In
>     >> general you'l have different areas of the graph explored at
>     each superstep,
>     >> and hence potentially a varying workload across different
>     supersteps.
>     >>
>     >> 3) Connected Components: this will have something opposite to
>     (2) as it
>     >> will have many active vertices at the beginning, where the
>     detection is
>     >> refined towards the end.
>     >>
>     >>
>     >> Hope this helps,
>     >> Claudio
>     >>
>     >>
>     >> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang _<weiz@us.ibm.com>_
>     <mailto:weiz@us.ibm.com> wrote:
>     >>
>     >>> Hi,
>     >>>
>     >>> I am interested in measuring some performance numbers of
>     Giraph on my
>     >>> machine.
>     >>>
>     >>> I am wondering are there some pointers where I can get some
>     >>> (configurable) reasonably large workload to work on ?
>     >>>
>     >>> Thanks!
>     >>>
>     >>> Wei
>     >>>
>     >>
>     >>
>     >>
>     >> --
>     >>    Claudio Martella
>     >> _claudio.martella@gmail.com_ <mailto:claudio.martella@gmail.com>
>     >>
>     >
>
>


Mime
View raw message