giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Zhang <w...@us.ibm.com>
Subject Re: workload used to measure Giraph performance number
Date Thu, 10 Oct 2013 18:12:31 GMT

Thanks a lot Avery, I will take look into PageRankBenchmark  and
PseudoRandomIntNullVertexInputFormat class.

Wei





From:	Avery Ching <aching@apache.org>
To:	user@giraph.apache.org,
Date:	10/10/2013 01:59 PM
Subject:	Re: workload used to measure Giraph performance number



We don't have a data generator that produces HDFS files, but
PageRankBenchmark does have a data generator (see
PseudoRandomIntNullVertexInputFormat)

Hope that helps,

Avery

On 10/9/13 2:49 PM, Wei Zhang wrote:


      Hi Avery,

      Thanks a lot for the pointer! I haven't quite got to the point where
      I can tune the Giraph performance yet (hopefully I can get there
      soon)

      I am still at the early stage of finding/generating the right
      workload to measure the performance. I am wondering is there some
      pointer (in Giraph or elsewhere) that I can use to generate a random
      or representative workload (for pagerank, for now), which complies
      with JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is
      greatly appreciated. -- Sorry for being a slow starter.

      I have written a small program myself to generate a random graph that
      has an average out-degree of 6 connected graph that complies with the
      JSON format. For a 2000-vertice graph, it takes about 12 seconds to
      run the pagerank example provided by Giraph (30 supersteps)
      I am only using 1 core of my machine (as I run in the Hadoop's
      "local" mode so that I could jdb into the Giraph job to observe how
      code runs if needed). Each core (out of 8 in total) of my machine is
      Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in
      total. Assuming the linear speedup when turning on all the 8 cores,
      it is about 1.5 seconds to run the work -- not sure if it is a
      reasonable number or not .

      Thanks!

      Wei
      Inactive
          hide details for Avery Ching ---10/09/2013 05:11:39
      PM---Hi
          Wei, For best performance, please be sure to tune
      the GC sAvery Ching ---10/09/2013 05:11:39 PM---Hi Wei, For best
      performance, please be sure to tune the GC settings, use Java

      From: Avery Ching <aching@apache.org>
      To: user@giraph.apache.org,
      Date: 10/09/2013 05:11 PM
      Subject: Re: workload used to measure Giraph performance number





      Hi Wei,

      For best performance, please be sure to tune the GC settings, use
      Java 7, tune the number of cores used for computation, communication,
      etc. and the combiner.

      We also have some numbers on our recent Facebook blog post.

      https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920


      Avery

      On 10/8/13 7:43 PM, Wei Zhang wrote:

            Hi Sebastian,

            Thanks a lot for the help! Sorry for the late response!

            At this point, I would only need a random graph that complies
            with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to
            measure the pagerank example (of giraph) performance.
            I am wondering how to convert the data from Koblenz to the
            such a graph ? Is there any pointer of doing this ? (This is
            the same kind of question that I raised to Alok on SNAP)

            Thanks!

            Wei

            P.S.: I forgot to mention in all my previous emails that I just
            get started with distributed graph engine, so please forgive if
            my questions are too naive.

            Inactive hide details for Sebastian Schelter
            ---10/02/2013 12:41:27 PM---Another option is to use the
            Koblenz network collectioSebastian Schelter ---10/02/2013
            12:41:27 PM---Another option is to use the Koblenz network
            collection [1], which offers even more (and larger) dat

            From: Sebastian Schelter <ssc@apache.org>
            To: user@giraph.apache.org,
            Date: 10/02/2013 12:41 PM
            Subject: Re: workload used to measure Giraph performance number



            Another option is to use the Koblenz network collection [1],
            which
            offers even more (and larger) datasets than Snap.

            Best,
            Sebastian

            [1] http://konect.uni-koblenz.de/


            On 02.10.2013 17:41, Alok Kumbhare wrote:
            > There are a number real (medium sized) graphs at
            > http://snap.stanford.edu/data/index.html which we use for
            similar
            > benchmarks. It has a good mix of graph types, sparse/dense,
            ground truth
            > graphs (e.g. social networks that follow power law
            distribution etc.). So
            > far we have observed that the type of graph has a high impact
            on the
            > performance of algorithms that Claudio mentioned.
            >
            >
            > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <
            claudio.martella@gmail.com
            >> wrote:
            >
            >> Hi Wei,
            >>
            >> it depends on what you mean by workload for a batch
            processing system. I
            >> believe we can split the problem in two: generating a
            realistic graph, and
            >> using "representative" algorithms.
            >>
            >> To generate graphs we have two options in giraph:
            >>
            >> 1) random graph: you specify the number of vertices and the
            number of
            >> edges for each vertex, and the edges will connect two random
            vertices. This
            >> creates a graph with (i) low clustering coefficient, (ii)
            low average path
            >> length, (ii) a uniform degree distribution
            >>
            >> 2) watts strogatz: you specify the number of vertices, the
            number of
            >> edges, and a rewire probability beta. giraph will generate a
            ring lattice
            >> (each vertex is connected to k preceeding vertices and k
            following
            >> vertices) and rewire some of the edges randomly. This will
            create a graph
            >> with (i) high clustering coefficient, (ii) low average path
            length, (iii)
            >> poisson-like degree distribution (depends on beta). This
            graph will
            >> resemble a small world graph such as a social network,
            except for the
            >> degree distribution which will not a power law.
            >>
            >> To use representative algorithms you can choose:
            >>
            >> 1) PageRank: it's a ranking algorithm where all the vertices
            are active
            >> and send messages along the edges at each superstep (hence
            you'll have O(V)
            >> active vertices and O(E) messages)
            >>
            >> 2) Shortest Paths: starting from a random vertex you'll
            visit al the
            >> vertices in the graph (some multiple times). This will have
            an aggregate
            >> O(V) active vertices and O(E) messages, but this is only a
            lower bound. In
            >> general you'l have different areas of the graph explored at
            each superstep,
            >> and hence potentially a varying workload across different
            supersteps.
            >>
            >> 3) Connected Components: this will have something opposite
            to (2) as it
            >> will have many active vertices at the beginning, where the
            detection is
            >> refined towards the end.
            >>
            >>
            >> Hope this helps,
            >> Claudio
            >>
            >>
            >> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <weiz@us.ibm.com>
            wrote:
            >>
            >>> Hi,
            >>>
            >>> I am interested in measuring some performance numbers of
            Giraph on my
            >>> machine.
            >>>
            >>> I am wondering are there some pointers where I can get some
            >>> (configurable) reasonably large workload to work on ?
            >>>
            >>> Thanks!
            >>>
            >>> Wei
            >>>
            >>
            >>
            >>
            >> --
            >>    Claudio Martella
            >>    claudio.martella@gmail.com
            >>
            >



Mime
View raw message