giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Neumann <mneum...@spotify.com>
Subject Re: Changing index of a graph
Date Tue, 15 Apr 2014 21:40:39 GMT
I have a pipeline that creates a graph then does some transformations on it
(with Giraph).
In the end I want to dump it into Neo4j to allow for cypher queries.

I was told that I could make the batch import for Neo4j a lot faster if I
would use Long identifiers without holes, and therefore matching there
internal ID space. If I understand it right they use it to build an on disk
index with it using the ID's as offsets, that's why it should have no holes.

I didn't expect it to be so costly to change the index, but I guess this
way I could at least spread the load to the cluster, since batch import
happens on a single machine.

Thanks 4 the input, I will see what makes the most sense with the limited
time I have.


On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec <
lukas.nalezenec@firma.seznam.cz> wrote:

>  Hi,
> I did same think in two M/R jobs during preprocesing - it was pretty
> powerful for web graphs but little bit slow.
>
> Solution for Giraph is:
> 1. Implement own partition which will iterate vertices in order. Use
> appropriate partitioner.
> 2. During first iteration you need to rename vertexes in each partition
> without holes. Holes will be only between partitions.
>     At the end, get min and max vertex index for each partion, send it to
> master in aggregator and compute mapping required to delete holes.
> 3. During second iteration iterate all vertexes and delete holes by
> shifting vertex indexes.
>
> 4. .... rename edges (two more iterations)...
>
> Btw: Why do you need such indexes ? For HLL ?
>
> Lukas
>
>
> On 15.4.2014 15:33, Martin Neumann wrote:
>
> Hej,
>
>  I have a huge edgelist (several billion edges) where node ID's are URL's.
> The algorithm I want to run needs the ID's to be long and there should be
> no holes in the ID space (so I cant simply hash the URL's).
>
>  Is anyone aware of a simple solution that does not require a impractical
> huge hash map?
>
>  My idea currently is to load the graph into another giraph job and then
> assigning a number to each node. This way the mapping of number to URL
> would be stored in the Node.
> Problem is that I have to assign the numbers in a sequential way to ensure
> there are no holes and numbers are unique. No Idea if this is even possible
> in Giraph.
>
>  Any input is welcome
>
>  cheers Martin
>
>
>

Mime
View raw message