giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Zhang <w...@us.ibm.com>
Subject Re: workload used to measure Giraph performance number
Date Wed, 09 Oct 2013 21:49:57 GMT

Hi Avery,

Thanks a lot for the pointer! I haven't quite got to the point where I can
tune the Giraph performance yet (hopefully I can get there soon)

I am still at the early stage of finding/generating the right workload to
measure the performance. I am wondering is there some pointer (in Giraph or
elsewhere) that I can use to generate a random or representative workload
(for pagerank, for now), which complies with
JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is greatly
appreciated. -- Sorry for being a slow starter.

I have written a small program myself to generate a random graph that has
an average out-degree of 6 connected graph that complies with the JSON
format. For a 2000-vertice graph, it takes about 12 seconds to run the
pagerank example provided by Giraph (30 supersteps)
I am only using 1 core of my machine (as I run in the Hadoop's "local" mode
so that I could jdb into the Giraph job to observe how code runs if
needed). Each core (out of 8 in total) of my machine is Intel(R) Core(TM)
i7-3740QM CPU @ 2.70GHz, there is 16GB memory in total. Assuming the linear
speedup when turning on all the 8 cores, it is about 1.5 seconds to run the
work -- not sure if it is a reasonable number or not .

Thanks!

Wei


From:	Avery Ching <aching@apache.org>
To:	user@giraph.apache.org,
Date:	10/09/2013 05:11 PM
Subject:	Re: workload used to measure Giraph performance number



Hi Wei,

For best performance, please be sure to tune the GC settings, use Java 7,
tune the number of cores used for computation, communication, etc. and the
combiner.

We also have some numbers on our recent Facebook blog post.

https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920


Avery

On 10/8/13 7:43 PM, Wei Zhang wrote:


      Hi Sebastian,

      Thanks a lot for the help! Sorry for the late response!

      At this point, I would only need a random graph that complies with
      JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the
      pagerank example (of giraph) performance.
      I am wondering how to convert the data from Koblenz to the  such a
      graph ? Is there any pointer of doing this ? (This is the same kind
      of question that I raised to Alok on SNAP)

      Thanks!

      Wei

      P.S.: I forgot to mention in all my previous emails that I just get
      started with distributed graph engine, so please forgive if my
      questions are too naive.

      Inactive
          hide details for Sebastian Schelter ---10/02/2013
      12:41:27
          PM---Another option is to use the Koblenz network
      collectioSebastian Schelter ---10/02/2013 12:41:27 PM---Another
      option is to use the Koblenz network collection [1], which offers
      even more (and larger) dat

      From: Sebastian Schelter <ssc@apache.org>
      To: user@giraph.apache.org,
      Date: 10/02/2013 12:41 PM
      Subject: Re: workload used to measure Giraph performance number





      Another option is to use the Koblenz network collection [1], which
      offers even more (and larger) datasets than Snap.

      Best,
      Sebastian

      [1] http://konect.uni-koblenz.de/


      On 02.10.2013 17:41, Alok Kumbhare wrote:
      > There are a number real (medium sized) graphs at
      > http://snap.stanford.edu/data/index.html which we use for similar
      > benchmarks. It has a good mix of graph types, sparse/dense, ground
      truth
      > graphs (e.g. social networks that follow power law distribution
      etc.). So
      > far we have observed that the type of graph has a high impact on
      the
      > performance of algorithms that Claudio mentioned.
      >
      >
      > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <
      claudio.martella@gmail.com
      >> wrote:
      >
      >> Hi Wei,
      >>
      >> it depends on what you mean by workload for a batch processing
      system. I
      >> believe we can split the problem in two: generating a realistic
      graph, and
      >> using "representative" algorithms.
      >>
      >> To generate graphs we have two options in giraph:
      >>
      >> 1) random graph: you specify the number of vertices and the number
      of
      >> edges for each vertex, and the edges will connect two random
      vertices. This
      >> creates a graph with (i) low clustering coefficient, (ii) low
      average path
      >> length, (ii) a uniform degree distribution
      >>
      >> 2) watts strogatz: you specify the number of vertices, the number
      of
      >> edges, and a rewire probability beta. giraph will generate a ring
      lattice
      >> (each vertex is connected to k preceeding vertices and k following
      >> vertices) and rewire some of the edges randomly. This will create
      a graph
      >> with (i) high clustering coefficient, (ii) low average path
      length, (iii)
      >> poisson-like degree distribution (depends on beta). This graph
      will
      >> resemble a small world graph such as a social network, except for
      the
      >> degree distribution which will not a power law.
      >>
      >> To use representative algorithms you can choose:
      >>
      >> 1) PageRank: it's a ranking algorithm where all the vertices are
      active
      >> and send messages along the edges at each superstep (hence you'll
      have O(V)
      >> active vertices and O(E) messages)
      >>
      >> 2) Shortest Paths: starting from a random vertex you'll visit al
      the
      >> vertices in the graph (some multiple times). This will have an
      aggregate
      >> O(V) active vertices and O(E) messages, but this is only a lower
      bound. In
      >> general you'l have different areas of the graph explored at each
      superstep,
      >> and hence potentially a varying workload across different
      supersteps.
      >>
      >> 3) Connected Components: this will have something opposite to (2)
      as it
      >> will have many active vertices at the beginning, where the
      detection is
      >> refined towards the end.
      >>
      >>
      >> Hope this helps,
      >> Claudio
      >>
      >>
      >> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <weiz@us.ibm.com> wrote:
      >>
      >>> Hi,
      >>>
      >>> I am interested in measuring some performance numbers of Giraph
      on my
      >>> machine.
      >>>
      >>> I am wondering are there some pointers where I can get some
      >>> (configurable) reasonably large workload to work on ?
      >>>
      >>> Thanks!
      >>>
      >>> Wei
      >>>
      >>
      >>
      >>
      >> --
      >>    Claudio Martella
      >>    claudio.martella@gmail.com
      >>
      >



Mime
View raw message