giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <>
Subject Re: Giraph : newbie questions
Date Sat, 21 Jul 2012 13:05:43 GMT
There are already a couple of partitioners in the codebase, check those out.
Also, keep in mind that by using fewer workers you diminish network
communication but you also decrease parallelism.

On Fri, Jul 20, 2012 at 8:52 PM, Jonathan Bishop <> wrote:
> Avery,
> Is there an example of overriding the partitioner in the giraph 0.1
> distribution?
> Thanks,
> Jon
> On Tue, Jul 17, 2012 at 11:00 AM, Avery Ching <> wrote:
>> Answers inline.
>> On 7/17/12 1:22 AM, Nicolas DUGUE wrote:
>>> Thanks for your answer David !
>>> Okay, but, is there a way to force Giraph to partition the Graph in our
>>> own way and how to do that ? It may be useful to minimize communication
>>> between Giraph nodes.
>> The partitioning method is very customizable.  See GraphPartitionerFactory
>> as the interface you need to implement. HashPartitionerFactory is what we
>> use as the default, but you can implement your own.
>>> You're talking about starting the job with a minimum of vertices and add
>>> new vertices then. It seems really interesting, how to do that and how does
>>> it work ?
>> The graph is mutable as the application is running.  See MutableVertex for
>> all the local and remote mutations you can make.
>>> For example, I run my Giraph job with half of the vertices and during my
>>> first superstep, I add (I don't know how) some vertices to my file. Will
>>> these vertices be taken in account for my first superstep or just for the
>>> next superstep.
>>> And when the vertices are loaded, is it possible to remove it from the
>>> memory ? In other words, I can add new vertices, can I remove vertices too ?
>>> So, is it possible to change the topology of my graph dynamically ?
>> Yes, see above.
>>> Moreover, I'm still wondering what is the best ? Launching one VM with
>>> Giraph on each server and with 20GB of Ram OR launching two of its with 10GB
>>> of RAM for each ?
>> Well, in that case, I'm guessing one server with 20 GB since there would
>> be no communication (most of the effort).
>>> And finally, when I launch a Giraph Job, Zookeeper is loaded in one
>>> virtual machine alone... Is there a way to run some Giraph jobs in this
>>> virtual machine too ? Or to mention explicitely in which VM running the
>>> ZooKeeper Job ?
>> ZooKeeper runs in the same slot as the master process, not sure you'd want
>> to do more there as it's best to balance the memory usage across the
>> workers.
>>> Best regards,
>>> Nicolas
>>> On 16/07/2012 21:51, David Garcia wrote:
>>>> Giraph partitions the vertices using a hashing function that's basically
>>>> the equivalent of (hash(vertexID) mod #ofComputeNodes).
>>>> You can mitigate memory issues by starting the job with a minimum of
>>>> vertices in your file and then add them dynamically as your job
>>>> progresses
>>>> (assuming that your job doesn't require all of the vertices).
>>>> -David
>>>> On 7/16/12 4:36 AM, "Nicolas DUGUE" <>
>>>> wrote:
>>>>> Hi everybody,
>>>>>      I'm new to Giraph so I have a few questions about how it works and
>>>>> so how to configure it to make it work as well as possible.
>>>>>      We have settled a cluster of 6 servers with 24 cpu, 24GB of RAM
>>>>> and
>>>>> we want to use it to experiment with Giraph.
>>>>>      Currently, we've made a few runs and we have some problems with
>>>>> memory, it seems that we don't give enough of it to the JVM (GC
>>>>> overhead, OutOfMemory, ...).
>>>>>      Our experiments were benchmarks using the PageRank, we only
>>>>> succeed
>>>>> in running it on a 100 millions edges graph by running two virtual
>>>>> machines with 8GB of Ram on each of our server.
>>>>>      Here are our questions :
>>>>>      - What is the best ? Launching one VM with Giraph on each server
>>>>> and with 20GB of Ram OR launching two of its with 10GB of RAM for each
>>>>> ?
>>>>>      - Are there a way to minimize the memory used by Hadoop to give
>>>>> more memory to the Giraph jobs ?
>>>>>      - How is the graph distributed across the cluster ? Our graph may
>>>>> be a power-law graph with a few nodes with a very large amount of edges
>>>>> and a lot of nodes with a few edges. How Giraph will distribute this
>>>>> kind of graph ? Does it take in account the number of edges of each
>>>>> vertice ?
>>>>> Thanks in advance,
>>>>> Nicolas Dugué
>>>>> PhD student at the Univeristy of Orléans

   Claudio Martella

View raw message