incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yavuz gokirmak <ygokir...@gmail.com>
Subject Re: Estimating approximate hadoop cluster size
Date Mon, 20 Feb 2012 19:57:54 GMT
Thank you Claudio, all points are cleared..

Actually, in my case, execution speed is not the main target,
my network analysis may work as a batch process on daily basis,

You have mentioned mapreduce based solution rather than giraph/pregel
approach,
I found a project named xrime but it seems development is halted,
You know any active project based on mapreduce rather than giraph approach
for graph processing?



On 20 February 2012 19:26, Claudio Martella <claudio.martella@gmail.com>wrote:

> As Avery put it, it's difficult to estimate the memory footprint of
> your graph. On one side you will have probably less memory footprint
> due to usage of compact generic types for your Vertex, compared to the
> amount of data required to persist them as Text on HDFS. I.e. it takes
> 4bytes to store 10000000 as an int but much more as a unicode string
> on file. On the other side keeping a vertex in memory also means
> keeping in memory the related data-structures, which is also another
> story to estimate.
> In general, i think it should be made quite clear that Giraph and
> Pregel were designed for scenarios where you can keep your graph and
> the messages produced in memory. That's what it makes them so fast. If
> your graph is >> your memory, you better just stick to mapreduce which
> is exactly designed for this. After all, your  computation will be
> dominated by disk I/O, so there's not so much you can take advantage
> from Giraph and Pregel, even when out-of-core graph and messages
> implementations will be ready.
>
> Hope this helps,
>
> On Mon, Feb 20, 2012 at 10:25 AM, yavuz gokirmak <ygokirmak@gmail.com>
> wrote:
> > Yes, I don't need to load a graph of 4tb size,
> >
> >
> > 4tb is the whole traffik, each row represents a connection between two
> users
> > with lots of additional information:
> > format1:
> > usera, userb, additionalinfo1, additionalinfo2, additionalinfo3,
> > additionalinfo4. ...
> >
> > I have converted this raw file into a more usable one, now each row
> > corresponds to a user and a list of users he has connected:
> > format2:
> > usera [userb,usere]
> > userb [userc,userx,usery,usert]
> > userc [userb]
> > ..
> > ..
> > ..
> >
> > In order to get some numbers for sizing decision I have converted
> one-hour
> > data (3 gb) into format2 result file is 155 mb.
> > This one hour data contains 3272300 rows (vertices with neighbour lists).
> > Although the size of my data decreases dramatically I couldn't figure out
> > the size of 4tb data when converted to format2,
> > The converted version of 4tb data will have 15 million of rows
> approximately
> > but rows will have bigger neighbour lists then one-hour example data.
> >
> > Say I will have 15 millions of rows, each have approximatelly 50 users in
> > their neighbour list,
> > what will be the approximate memory I need on whole cluster?
> >
> > Sorry for lots of question,
> >
> > best regards..
> >
> >
> >
> >
> >
> >
> > On 20 February 2012 08:59, Avery Ching <aching@apache.org> wrote:
> >>
> >> Yes, you will need a lot of ram, until we get out-of-core partitions
> >> and/or out-of-core messages.  Do you really need to load all 4 TB of
> data?
> >>  The vertex index, vertex value, edge value, and message value objects
> all
> >> take up space as well as the data structures to store them (hence your
> >> estimates are definitely too low).  How big is the actual graph that
> you are
> >> trying to analyze in terms of vertices and edges?
> >>
> >> Avery
> >>
> >>
> >> On 2/19/12 10:45 PM, yavuz gokirmak wrote:
> >>>
> >>> Hi again,
> >>>
> >>> I am trying to estimate minimum requirements to process graph analysis
> >>> over my input data,
> >>>
> >>> In shortest path example it is said that
> >>> "The first thing that happens is that getSplits() is called by the
> master
> >>> and then the workers will process the InputSplit objects with the
> >>> VertexReader to load their portion of the graph into memory"
> >>>
> >>> What I undestood is in a time T all graph nodes must be loaded on
> cluster
> >>> memory.
> >>> If I have 100 gb of graph data, will I need 25 machines having 4 gb ram
> >>> each?
> >>>
> >>> If this is the case I have a big memory problem to anaylze 4tb data :)
> >>>
> >>> best regards.
> >>
> >>
> >
>
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Mime
View raw message