giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Kumar A <>
Subject RE: Changing index of a graph
Date Wed, 16 Apr 2014 08:15:50 GMT
It totally depends on the input distribution, one very simple thing that can be done is:>
Define  a VertexResolver that upon every vertex creation sets its Id = domain of url &
value = "set" of urls in the domain; it keeps appending as more vertices with same id (i.e.,
domain) are read from input> [Now you can ignore edges all together. All you are left with
is these huge-vertices that are identified by domains & contain value = set of urls] Here,
you can use aggregator approach of sending in the (domain, count of set) to master -> these
aggregators are then combined to give something like [(domain1, offset1), (domain2, offset2),
etc.] all vertices (the huge ones) read this aggregator and figure out their offset then while
u output just output the vertices in the set with number = offset + number in set
So u have a map now -> though it is highly unstable because adding one more url to a domain
later will change the order totally, that is when u can use id = domain + insert date, etc.
[[which will stop working at some point because aggregator needs to carry huge messages, then
the computation of offsets via aggregators needs to be done in multiple supersteps, etc.]]
Anyway now that you have the map of url - numberAll you got to do is a join ->that's simpleread
your original table + this map table in a single giraph joband u can use 2 supersteps to rename
all the vertices properly

[[note that you can do another thing here as well, MUCH SIMPLER THAN ABOVE]> superstep
0in your compute class have a thread local variable that increases for each vertex the thread
computes, assign the value [(workerid,threadid) , number] to each vertex.
now aggregate {(workerid, threadid), number}
> superstep 1;master>now we have see [{(workerid, threadid), count in group}]so recompute
another aggregator which is like[{workerid, threadid), cumulative sum upto now]send this aggregator
to workers
workerread cumulative_sum from aggregator and add it up to each vertex's current value
when you output the graph this time as edgeoutput, sourceid, targetid are set as vertex values
= the count
Date: Tue, 15 Apr 2014 23:40:39 +0200
Subject: Re: Changing index of a graph

I have a pipeline that creates a graph then does some transformations on it (with Giraph).
In the end I want to dump it into Neo4j to allow for cypher queries.  
I was told that I could make the batch import for Neo4j a lot faster if I would use Long identifiers
without holes, and therefore matching there internal ID space. If I understand it right they
use it to build an on disk index with it using the ID's as offsets, that's why it should have
no holes.

I didn't expect it to be so costly to change the index, but I guess this way I could at least
spread the load to the cluster, since batch import happens on a single machine.

Thanks 4 the input, I will see what makes the most sense with the limited time I have.

On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec <> wrote:


      I did same think in two M/R jobs during preprocesing - it was
      pretty powerful for web graphs but little bit slow.


      Solution for Giraph is:

      1. Implement own partition which will iterate vertices in order.
      Use appropriate partitioner.

      2. During first iteration you need to rename vertexes in each
      partition without holes. Holes will be only between partitions.

          At the end, get min and max vertex index for each partion,
      send it to master in aggregator and compute mapping required to
      delete holes.

      3. During second iteration iterate all vertexes and delete holes
      by shifting vertex indexes. 


      4. .... rename edges (two more iterations)...


      Btw: Why do you need such indexes ? For HLL ?




      On 15.4.2014 15:33, Martin Neumann wrote:


        I have a huge edgelist (several billion edges) where node
          ID's are URL's.
        The algorithm I want to run needs the ID's to be long and
          there should be no holes in the ID space (so I cant simply
          hash the URL's).

        Is anyone aware of a simple solution that does not require
          a impractical huge hash map?

        My idea currently is to load the graph into another giraph
          job and then assigning a number to each node. This way the
          mapping of number to URL would be stored in the Node.
        Problem is that I have to assign the numbers in a
          sequential way to ensure there are no holes and numbers are
          unique. No Idea if this is even possible in Giraph.

        Any input is welcome

        cheers Martin


View raw message