Return-Path: X-Original-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A8FF295C7 for ; Mon, 20 Feb 2012 19:58:21 +0000 (UTC) Received: (qmail 13919 invoked by uid 500); 20 Feb 2012 19:58:21 -0000 Delivered-To: apmail-incubator-giraph-user-archive@incubator.apache.org Received: (qmail 13892 invoked by uid 500); 20 Feb 2012 19:58:21 -0000 Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-user@incubator.apache.org Delivered-To: mailing list giraph-user@incubator.apache.org Received: (qmail 13884 invoked by uid 99); 20 Feb 2012 19:58:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Feb 2012 19:58:21 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ygokirmak@gmail.com designates 209.85.214.175 as permitted sender) Received: from [209.85.214.175] (HELO mail-tul01m020-f175.google.com) (209.85.214.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Feb 2012 19:58:15 +0000 Received: by obhx4 with SMTP id x4so8004292obh.6 for ; Mon, 20 Feb 2012 11:57:54 -0800 (PST) Received-SPF: pass (google.com: domain of ygokirmak@gmail.com designates 10.60.9.233 as permitted sender) client-ip=10.60.9.233; Authentication-Results: mr.google.com; spf=pass (google.com: domain of ygokirmak@gmail.com designates 10.60.9.233 as permitted sender) smtp.mail=ygokirmak@gmail.com; dkim=pass header.i=ygokirmak@gmail.com Received: from mr.google.com ([10.60.9.233]) by 10.60.9.233 with SMTP id d9mr10200517oeb.58.1329767874719 (num_hops = 1); Mon, 20 Feb 2012 11:57:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=bMXj4NwvTgyEfTLgZRzotLCZxHtgRLV5lal+tSckmww=; b=Knc7vndx5opZPEPjC4Whfd1XWgUaC+oy7oHNRrzKrEYumxbrikT/XSGJja4sli3fp6 rmDvfnnRo9zkbb2PfbZIWCcWDIy3lQejFoVbdPPvdt1rzxoTHlV+DR1HplTJpICiT0Y9 oZN8wB0q1Xfx2YHO4pfLYdLxh08FBTuTt3aig= MIME-Version: 1.0 Received: by 10.60.9.233 with SMTP id d9mr8739197oeb.58.1329767874657; Mon, 20 Feb 2012 11:57:54 -0800 (PST) Received: by 10.182.227.36 with HTTP; Mon, 20 Feb 2012 11:57:54 -0800 (PST) In-Reply-To: References: <4F41EF6A.1060200@apache.org> Date: Mon, 20 Feb 2012 21:57:54 +0200 Message-ID: Subject: Re: Estimating approximate hadoop cluster size From: yavuz gokirmak To: giraph-user@incubator.apache.org Content-Type: multipart/alternative; boundary=e89a8f3b55d6cd7ec104b96ab5af --e89a8f3b55d6cd7ec104b96ab5af Content-Type: text/plain; charset=ISO-8859-1 Thank you Claudio, all points are cleared.. Actually, in my case, execution speed is not the main target, my network analysis may work as a batch process on daily basis, You have mentioned mapreduce based solution rather than giraph/pregel approach, I found a project named xrime but it seems development is halted, You know any active project based on mapreduce rather than giraph approach for graph processing? On 20 February 2012 19:26, Claudio Martella wrote: > As Avery put it, it's difficult to estimate the memory footprint of > your graph. On one side you will have probably less memory footprint > due to usage of compact generic types for your Vertex, compared to the > amount of data required to persist them as Text on HDFS. I.e. it takes > 4bytes to store 10000000 as an int but much more as a unicode string > on file. On the other side keeping a vertex in memory also means > keeping in memory the related data-structures, which is also another > story to estimate. > In general, i think it should be made quite clear that Giraph and > Pregel were designed for scenarios where you can keep your graph and > the messages produced in memory. That's what it makes them so fast. If > your graph is >> your memory, you better just stick to mapreduce which > is exactly designed for this. After all, your computation will be > dominated by disk I/O, so there's not so much you can take advantage > from Giraph and Pregel, even when out-of-core graph and messages > implementations will be ready. > > Hope this helps, > > On Mon, Feb 20, 2012 at 10:25 AM, yavuz gokirmak > wrote: > > Yes, I don't need to load a graph of 4tb size, > > > > > > 4tb is the whole traffik, each row represents a connection between two > users > > with lots of additional information: > > format1: > > usera, userb, additionalinfo1, additionalinfo2, additionalinfo3, > > additionalinfo4. ... > > > > I have converted this raw file into a more usable one, now each row > > corresponds to a user and a list of users he has connected: > > format2: > > usera [userb,usere] > > userb [userc,userx,usery,usert] > > userc [userb] > > .. > > .. > > .. > > > > In order to get some numbers for sizing decision I have converted > one-hour > > data (3 gb) into format2 result file is 155 mb. > > This one hour data contains 3272300 rows (vertices with neighbour lists). > > Although the size of my data decreases dramatically I couldn't figure out > > the size of 4tb data when converted to format2, > > The converted version of 4tb data will have 15 million of rows > approximately > > but rows will have bigger neighbour lists then one-hour example data. > > > > Say I will have 15 millions of rows, each have approximatelly 50 users in > > their neighbour list, > > what will be the approximate memory I need on whole cluster? > > > > Sorry for lots of question, > > > > best regards.. > > > > > > > > > > > > > > On 20 February 2012 08:59, Avery Ching wrote: > >> > >> Yes, you will need a lot of ram, until we get out-of-core partitions > >> and/or out-of-core messages. Do you really need to load all 4 TB of > data? > >> The vertex index, vertex value, edge value, and message value objects > all > >> take up space as well as the data structures to store them (hence your > >> estimates are definitely too low). How big is the actual graph that > you are > >> trying to analyze in terms of vertices and edges? > >> > >> Avery > >> > >> > >> On 2/19/12 10:45 PM, yavuz gokirmak wrote: > >>> > >>> Hi again, > >>> > >>> I am trying to estimate minimum requirements to process graph analysis > >>> over my input data, > >>> > >>> In shortest path example it is said that > >>> "The first thing that happens is that getSplits() is called by the > master > >>> and then the workers will process the InputSplit objects with the > >>> VertexReader to load their portion of the graph into memory" > >>> > >>> What I undestood is in a time T all graph nodes must be loaded on > cluster > >>> memory. > >>> If I have 100 gb of graph data, will I need 25 machines having 4 gb ram > >>> each? > >>> > >>> If this is the case I have a big memory problem to anaylze 4tb data :) > >>> > >>> best regards. > >> > >> > > > > > > -- > Claudio Martella > claudio.martella@gmail.com > --e89a8f3b55d6cd7ec104b96ab5af Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thank you Claudio, all points are cleared..

Actually, in my case, ex= ecution speed is not the main target,
my network analysis may work as a = batch process on daily basis,

You have mentioned mapreduce based sol= ution rather than giraph/pregel approach,
I found a project named xrime but it seems development is halted,
You kn= ow any active project based on mapreduce rather than giraph approach for gr= aph processing?



On 20 February 20= 12 19:26, Claudio Martella <claudio.martella@gmail.com> wrote:
As Avery put it, it's difficult to estim= ate the memory footprint of
your graph. On one side you will have probably less memory footprint
due to usage of compact generic types for your Vertex, compared to the
amount of data required to persist them as Text on HDFS. I.e. it takes
4bytes to store 10000000 as an int but much more as a unicode string
on file. On the other side keeping a vertex in memory also means
keeping in memory the related data-structures, which is also another
story to estimate.
In general, i think it should be made quite clear that Giraph and
Pregel were designed for scenarios where you can keep your graph and
the messages produced in memory. That's what it makes them so fast. If<= br> your graph is >> your memory, you better just stick to mapreduce whic= h
is exactly designed for this. After all, your =A0computation will be
dominated by disk I/O, so there's not so much you can take advantage from Giraph and Pregel, even when out-of-core graph and messages
implementations will be ready.

Hope this helps,

On Mon, Feb 20, 2012 at 10:25 AM, yavuz gokirmak <ygokirmak@gmail.com> wrote:
> Yes, I don't need to load a graph of 4tb size,
>
>
> 4tb is the whole traffik, each row represents a connection between two= users
> with lots of additional information:
> format1:
> usera, userb, additionalinfo1, additionalinfo2, additionalinfo3,
> additionalinfo4. ...
>
> I have converted this raw file into a more usable one, now each row > corresponds to a user and a list of users he has connected:
> format2:
> usera [userb,usere]
> userb [userc,userx,usery,usert]
> userc [userb]
> ..
> ..
> ..
>
> In order to get some numbers for sizing decision I have converted one-= hour
> data (3 gb) into format2 result file is 155 mb.
> This one hour data contains 3272300 rows (vertices with neighbour list= s).
> Although the size of my data decreases dramatically I couldn't fig= ure out
> the size of 4tb data when converted to format2,
> The converted version of 4tb data will have 15 million of rows approxi= mately
> but rows will have bigger neighbour lists then one-hour example data.<= br> >
> Say I will have 15 millions of rows, each have approximatelly 50 users= in
> their neighbour list,
> what will be the approximate memory I need on whole cluster?
>
> Sorry for lots of question,
>
> best regards..
>
>
>
>
>
>
> On 20 February 2012 08:59, Avery Ching <aching@apache.org> wrote:
>>
>> Yes, you will need a lot of ram, until we get out-of-core partitio= ns
>> and/or out-of-core messages. =A0Do you really need to load all 4 T= B of data?
>> =A0The vertex index, vertex value, edge value, and message value o= bjects all
>> take up space as well as the data structures to store them (hence = your
>> estimates are definitely too low). =A0How big is the actual graph = that you are
>> trying to analyze in terms of vertices and edges?
>>
>> Avery
>>
>>
>> On 2/19/12 10:45 PM, yavuz gokirmak wrote:
>>>
>>> Hi again,
>>>
>>> I am trying to estimate minimum requirements to process graph = analysis
>>> over my input data,
>>>
>>> In shortest path example it is said that
>>> "The first thing that happens is that getSplits() is call= ed by the master
>>> and then the workers will process the InputSplit objects with = the
>>> VertexReader to load their portion of the graph into memory&qu= ot;
>>>
>>> What I undestood is in a time T all graph nodes must be loaded= on cluster
>>> memory.
>>> If I have 100 gb of graph data, will I need 25 machines having= 4 gb ram
>>> each?
>>>
>>> If this is the case I have a big memory problem to anaylze 4tb= data :)
>>>
>>> best regards.
>>
>>
>



--
=A0 =A0Claudio Martella
=A0 =A0claudio.martella@gmail= .com

--e89a8f3b55d6cd7ec104b96ab5af--