giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yavuz gokirmak <>
Subject Re: Estimating approximate hadoop cluster size
Date Mon, 20 Feb 2012 09:25:59 GMT
Yes, I don't need to load a graph of 4tb size,

4tb is the whole traffik, each row represents a connection between two
users with lots of additional information:
usera, userb, additionalinfo1, additionalinfo2, additionalinfo3,
additionalinfo4. ...

I have converted this raw file into a more usable one, now each row
corresponds to a user and a list of users he has connected:
usera [userb,usere]
userb [userc,userx,usery,usert]
userc [userb]

In order to get some numbers for sizing decision I have converted one-hour
data (3 gb) into format2 result file is 155 mb.
This one hour data contains 3272300 rows (vertices with neighbour lists).
Although the size of my data decreases dramatically I couldn't figure out
the size of 4tb data when converted to format2,
The converted version of 4tb data will have 15 million of rows
approximately but rows will have bigger neighbour lists then one-hour
example data.

Say I will have 15 millions of rows, each have approximatelly 50 users in
their neighbour list,
what will be the approximate memory I need on whole cluster?

Sorry for lots of question,

best regards..

On 20 February 2012 08:59, Avery Ching <> wrote:

> Yes, you will need a lot of ram, until we get out-of-core partitions
> and/or out-of-core messages.  Do you really need to load all 4 TB of data?
>  The vertex index, vertex value, edge value, and message value objects all
> take up space as well as the data structures to store them (hence your
> estimates are definitely too low).  How big is the actual graph that you
> are trying to analyze in terms of vertices and edges?
> Avery
> On 2/19/12 10:45 PM, yavuz gokirmak wrote:
>> Hi again,
>> I am trying to estimate minimum requirements to process graph analysis
>> over my input data,
>> In shortest path example it is said that
>> "The first thing that happens is that getSplits() is called by the master
>> and then the workers will process the InputSplit objects with the
>> VertexReader to load their portion of the graph into memory"
>> What I undestood is in a time T all graph nodes must be loaded on cluster
>> memory.
>> If I have 100 gb of graph data, will I need 25 machines having 4 gb ram
>> each?
>> If this is the case I have a big memory problem to anaylze 4tb data :)
>> best regards.

View raw message