incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: Estimating approximate hadoop cluster size
Date Mon, 20 Feb 2012 20:02:09 GMT
it really boils down to what you need to do. there are some libraries
out there that solve a particular problem: pagerank? single source
shortest path? diameter? connected components?

You can check pegasus (http://www.cs.cmu.edu/~pegasus/) or you can
generally read the paper about pegasus and try to use their model for
your own analytics.

On Mon, Feb 20, 2012 at 8:57 PM, yavuz gokirmak <ygokirmak@gmail.com> wrote:
> Thank you Claudio, all points are cleared..
>
> Actually, in my case, execution speed is not the main target,
> my network analysis may work as a batch process on daily basis,
>
> You have mentioned mapreduce based solution rather than giraph/pregel
> approach,
> I found a project named xrime but it seems development is halted,
> You know any active project based on mapreduce rather than giraph approach
> for graph processing?
>
>
>
>
> On 20 February 2012 19:26, Claudio Martella <claudio.martella@gmail.com>
> wrote:
>>
>> As Avery put it, it's difficult to estimate the memory footprint of
>> your graph. On one side you will have probably less memory footprint
>> due to usage of compact generic types for your Vertex, compared to the
>> amount of data required to persist them as Text on HDFS. I.e. it takes
>> 4bytes to store 10000000 as an int but much more as a unicode string
>> on file. On the other side keeping a vertex in memory also means
>> keeping in memory the related data-structures, which is also another
>> story to estimate.
>> In general, i think it should be made quite clear that Giraph and
>> Pregel were designed for scenarios where you can keep your graph and
>> the messages produced in memory. That's what it makes them so fast. If
>> your graph is >> your memory, you better just stick to mapreduce which
>> is exactly designed for this. After all, your  computation will be
>> dominated by disk I/O, so there's not so much you can take advantage
>> from Giraph and Pregel, even when out-of-core graph and messages
>> implementations will be ready.
>>
>> Hope this helps,
>>
>> On Mon, Feb 20, 2012 at 10:25 AM, yavuz gokirmak <ygokirmak@gmail.com>
>> wrote:
>> > Yes, I don't need to load a graph of 4tb size,
>> >
>> >
>> > 4tb is the whole traffik, each row represents a connection between two
>> > users
>> > with lots of additional information:
>> > format1:
>> > usera, userb, additionalinfo1, additionalinfo2, additionalinfo3,
>> > additionalinfo4. ...
>> >
>> > I have converted this raw file into a more usable one, now each row
>> > corresponds to a user and a list of users he has connected:
>> > format2:
>> > usera [userb,usere]
>> > userb [userc,userx,usery,usert]
>> > userc [userb]
>> > ..
>> > ..
>> > ..
>> >
>> > In order to get some numbers for sizing decision I have converted
>> > one-hour
>> > data (3 gb) into format2 result file is 155 mb.
>> > This one hour data contains 3272300 rows (vertices with neighbour
>> > lists).
>> > Although the size of my data decreases dramatically I couldn't figure
>> > out
>> > the size of 4tb data when converted to format2,
>> > The converted version of 4tb data will have 15 million of rows
>> > approximately
>> > but rows will have bigger neighbour lists then one-hour example data.
>> >
>> > Say I will have 15 millions of rows, each have approximatelly 50 users
>> > in
>> > their neighbour list,
>> > what will be the approximate memory I need on whole cluster?
>> >
>> > Sorry for lots of question,
>> >
>> > best regards..
>> >
>> >
>> >
>> >
>> >
>> >
>> > On 20 February 2012 08:59, Avery Ching <aching@apache.org> wrote:
>> >>
>> >> Yes, you will need a lot of ram, until we get out-of-core partitions
>> >> and/or out-of-core messages.  Do you really need to load all 4 TB of
>> >> data?
>> >>  The vertex index, vertex value, edge value, and message value objects
>> >> all
>> >> take up space as well as the data structures to store them (hence your
>> >> estimates are definitely too low).  How big is the actual graph that
>> >> you are
>> >> trying to analyze in terms of vertices and edges?
>> >>
>> >> Avery
>> >>
>> >>
>> >> On 2/19/12 10:45 PM, yavuz gokirmak wrote:
>> >>>
>> >>> Hi again,
>> >>>
>> >>> I am trying to estimate minimum requirements to process graph analysis
>> >>> over my input data,
>> >>>
>> >>> In shortest path example it is said that
>> >>> "The first thing that happens is that getSplits() is called by the
>> >>> master
>> >>> and then the workers will process the InputSplit objects with the
>> >>> VertexReader to load their portion of the graph into memory"
>> >>>
>> >>> What I undestood is in a time T all graph nodes must be loaded on
>> >>> cluster
>> >>> memory.
>> >>> If I have 100 gb of graph data, will I need 25 machines having 4 gb
>> >>> ram
>> >>> each?
>> >>>
>> >>> If this is the case I have a big memory problem to anaylze 4tb data
:)
>> >>>
>> >>> best regards.
>> >>
>> >>
>> >
>>
>>
>>
>> --
>>    Claudio Martella
>>    claudio.martella@gmail.com
>
>



-- 
   Claudio Martella
   claudio.martella@gmail.com

Mime
View raw message