incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <>
Subject Re: Estimating approximate hadoop cluster size
Date Mon, 20 Feb 2012 17:26:40 GMT
As Avery put it, it's difficult to estimate the memory footprint of
your graph. On one side you will have probably less memory footprint
due to usage of compact generic types for your Vertex, compared to the
amount of data required to persist them as Text on HDFS. I.e. it takes
4bytes to store 10000000 as an int but much more as a unicode string
on file. On the other side keeping a vertex in memory also means
keeping in memory the related data-structures, which is also another
story to estimate.
In general, i think it should be made quite clear that Giraph and
Pregel were designed for scenarios where you can keep your graph and
the messages produced in memory. That's what it makes them so fast. If
your graph is >> your memory, you better just stick to mapreduce which
is exactly designed for this. After all, your  computation will be
dominated by disk I/O, so there's not so much you can take advantage
from Giraph and Pregel, even when out-of-core graph and messages
implementations will be ready.

Hope this helps,

On Mon, Feb 20, 2012 at 10:25 AM, yavuz gokirmak <> wrote:
> Yes, I don't need to load a graph of 4tb size,
> 4tb is the whole traffik, each row represents a connection between two users
> with lots of additional information:
> format1:
> usera, userb, additionalinfo1, additionalinfo2, additionalinfo3,
> additionalinfo4. ...
> I have converted this raw file into a more usable one, now each row
> corresponds to a user and a list of users he has connected:
> format2:
> usera [userb,usere]
> userb [userc,userx,usery,usert]
> userc [userb]
> ..
> ..
> ..
> In order to get some numbers for sizing decision I have converted one-hour
> data (3 gb) into format2 result file is 155 mb.
> This one hour data contains 3272300 rows (vertices with neighbour lists).
> Although the size of my data decreases dramatically I couldn't figure out
> the size of 4tb data when converted to format2,
> The converted version of 4tb data will have 15 million of rows approximately
> but rows will have bigger neighbour lists then one-hour example data.
> Say I will have 15 millions of rows, each have approximatelly 50 users in
> their neighbour list,
> what will be the approximate memory I need on whole cluster?
> Sorry for lots of question,
> best regards..
> On 20 February 2012 08:59, Avery Ching <> wrote:
>> Yes, you will need a lot of ram, until we get out-of-core partitions
>> and/or out-of-core messages.  Do you really need to load all 4 TB of data?
>>  The vertex index, vertex value, edge value, and message value objects all
>> take up space as well as the data structures to store them (hence your
>> estimates are definitely too low).  How big is the actual graph that you are
>> trying to analyze in terms of vertices and edges?
>> Avery
>> On 2/19/12 10:45 PM, yavuz gokirmak wrote:
>>> Hi again,
>>> I am trying to estimate minimum requirements to process graph analysis
>>> over my input data,
>>> In shortest path example it is said that
>>> "The first thing that happens is that getSplits() is called by the master
>>> and then the workers will process the InputSplit objects with the
>>> VertexReader to load their portion of the graph into memory"
>>> What I undestood is in a time T all graph nodes must be loaded on cluster
>>> memory.
>>> If I have 100 gb of graph data, will I need 25 machines having 4 gb ram
>>> each?
>>> If this is the case I have a big memory problem to anaylze 4tb data :)
>>> best regards.

   Claudio Martella

View raw message