Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: giraph-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of ygokirmak@gmail.com
 designates 209.85.214.175 as permitted sender)
Received-SPF: pass (google.com: domain of ygokirmak@gmail.com designates
 10.60.9.233 as permitted sender) client-ip=10.60.9.233;
MIME-Version: 1.0
In-Reply-To: 
 <CAFJOoJf2U1TLAhrkmcyQZH+vrrFWoxJeaOg5wt8UxaGHBShN-A@mail.gmail.com>
References: 
 <CAGNSiSbRt-EYPVAd-GsPDfkPdSwfHhH=bOdcBF+WjNjLk5u4uQ@mail.gmail.com>
	<4F41EF6A.1060200@apache.org>
	<CAGNSiSZpqDZFjs8RObp_pvVVTCbCJbnPNjCC2viHq3m+YY4qQA@mail.gmail.com>
	<CAFJOoJf2U1TLAhrkmcyQZH+vrrFWoxJeaOg5wt8UxaGHBShN-A@mail.gmail.com>
Date: Mon, 20 Feb 2012 21:57:54 +0200
Message-ID: 
 <CAGNSiSbz1GCaNZ-QxmvDrjSjeg6c+tM+1t_=J72LhKmxpBqv8Q@mail.gmail.com>
Subject: Re: Estimating approximate hadoop cluster size
From: yavuz gokirmak <ygokirmak@gmail.com>
To: giraph-user@incubator.apache.org
Content-Type: multipart/alternative; boundary=e89a8f3b55d6cd7ec104b96ab5af

--e89a8f3b55d6cd7ec104b96ab5af
Content-Type: text/plain; charset=ISO-8859-1

Thank you Claudio, all points are cleared..

Actually, in my case, execution speed is not the main target,
my network analysis may work as a batch process on daily basis,

You have mentioned mapreduce based solution rather than giraph/pregel
approach,
I found a project named xrime but it seems development is halted,
You know any active project based on mapreduce rather than giraph approach
for graph processing?


On 20 February 2012 19:26, Claudio Martella <claudio.martella@gmail.com>wrote:

> As Avery put it, it's difficult to estimate the memory footprint of
> your graph. On one side you will have probably less memory footprint
> due to usage of compact generic types for your Vertex, compared to the
> amount of data required to persist them as Text on HDFS. I.e. it takes
> 4bytes to store 10000000 as an int but much more as a unicode string
> on file. On the other side keeping a vertex in memory also means
> keeping in memory the related data-structures, which is also another
> story to estimate.
> In general, i think it should be made quite clear that Giraph and
> Pregel were designed for scenarios where you can keep your graph and
> the messages produced in memory. That's what it makes them so fast. If
> your graph is >> your memory, you better just stick to mapreduce which
> is exactly designed for this. After all, your  computation will be
> dominated by disk I/O, so there's not so much you can take advantage
> from Giraph and Pregel, even when out-of-core graph and messages
> implementations will be ready.
>
> Hope this helps,
>
> On Mon, Feb 20, 2012 at 10:25 AM, yavuz gokirmak <ygokirmak@gmail.com>
> wrote:
> > Yes, I don't need to load a graph of 4tb size,
> >
> >
> > 4tb is the whole traffik, each row represents a connection between two
> users
> > with lots of additional information:
> > format1:
> > usera, userb, additionalinfo1, additionalinfo2, additionalinfo3,
> > additionalinfo4. ...
> >
> > I have converted this raw file into a more usable one, now each row
> > corresponds to a user and a list of users he has connected:
> > format2:
> > usera [userb,usere]
> > userb [userc,userx,usery,usert]
> > userc [userb]
> > ..
> > ..
> > ..
> >
> > In order to get some numbers for sizing decision I have converted
> one-hour
> > data (3 gb) into format2 result file is 155 mb.
> > This one hour data contains 3272300 rows (vertices with neighbour lists).
> > Although the size of my data decreases dramatically I couldn't figure out
> > the size of 4tb data when converted to format2,
> > The converted version of 4tb data will have 15 million of rows
> approximately
> > but rows will have bigger neighbour lists then one-hour example data.
> >
> > Say I will have 15 millions of rows, each have approximatelly 50 users in
> > their neighbour list,
> > what will be the approximate memory I need on whole cluster?
> >
> > Sorry for lots of question,
> >
> > best regards..
> >
> >
> >
> >
> >
> >
> > On 20 February 2012 08:59, Avery Ching <aching@apache.org> wrote:
> >>
> >> Yes, you will need a lot of ram, until we get out-of-core partitions
> >> and/or out-of-core messages.  Do you really need to load all 4 TB of
> data?
> >>  The vertex index, vertex value, edge value, and message value objects
> all
> >> take up space as well as the data structures to store them (hence your
> >> estimates are definitely too low).  How big is the actual graph that
> you are
> >> trying to analyze in terms of vertices and edges?
> >>
> >> Avery
> >>
> >>
> >> On 2/19/12 10:45 PM, yavuz gokirmak wrote:
> >>>
> >>> Hi again,
> >>>
> >>> I am trying to estimate minimum requirements to process graph analysis
> >>> over my input data,
> >>>
> >>> In shortest path example it is said that
> >>> "The first thing that happens is that getSplits() is called by the
> master
> >>> and then the workers will process the InputSplit objects with the
> >>> VertexReader to load their portion of the graph into memory"
> >>>
> >>> What I undestood is in a time T all graph nodes must be loaded on
> cluster
> >>> memory.
> >>> If I have 100 gb of graph data, will I need 25 machines having 4 gb ram
> >>> each?
> >>>
> >>> If this is the case I have a big memory problem to anaylze 4tb data :)
> >>>
> >>> best regards.
> >>
> >>
> >
>
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

--e89a8f3b55d6cd7ec104b96ab5af
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thank you Claudio, all points are cleared..<br><br>Actually, in my case, ex=
ecution speed is not the main target,<br>my network analysis may work as a =
batch process on daily basis,<br><br>You have mentioned mapreduce based sol=
ution rather than giraph/pregel approach,<br>
I found a project named xrime but it seems development is halted,<br>You kn=
ow any active project based on mapreduce rather than giraph approach for gr=
aph processing?<br><br><br><br><div class=3D"gmail_quote">On 20 February 20=
12 19:26, Claudio Martella <span dir=3D"ltr">&lt;<a href=3D"mailto:claudio.=
martella@gmail.com">claudio.martella@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">As Avery put it, it&#39;s difficult to estim=
ate the memory footprint of<br>
your graph. On one side you will have probably less memory footprint<br>
due to usage of compact generic types for your Vertex, compared to the<br>
amount of data required to persist them as Text on HDFS. I.e. it takes<br>
4bytes to store 10000000 as an int but much more as a unicode string<br>
on file. On the other side keeping a vertex in memory also means<br>
keeping in memory the related data-structures, which is also another<br>
story to estimate.<br>
In general, i think it should be made quite clear that Giraph and<br>
Pregel were designed for scenarios where you can keep your graph and<br>
the messages produced in memory. That&#39;s what it makes them so fast. If<=
br>
your graph is &gt;&gt; your memory, you better just stick to mapreduce whic=
h<br>
is exactly designed for this. After all, your =A0computation will be<br>
dominated by disk I/O, so there&#39;s not so much you can take advantage<br=
>
from Giraph and Pregel, even when out-of-core graph and messages<br>
implementations will be ready.<br>
<br>
Hope this helps,<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Mon, Feb 20, 2012 at 10:25 AM, yavuz gokirmak &lt;<a href=3D"mailto:ygok=
irmak@gmail.com">ygokirmak@gmail.com</a>&gt; wrote:<br>
&gt; Yes, I don&#39;t need to load a graph of 4tb size,<br>
&gt;<br>
&gt;<br>
&gt; 4tb is the whole traffik, each row represents a connection between two=
 users<br>
&gt; with lots of additional information:<br>
&gt; format1:<br>
&gt; usera, userb, additionalinfo1, additionalinfo2, additionalinfo3,<br>
&gt; additionalinfo4. ...<br>
&gt;<br>
&gt; I have converted this raw file into a more usable one, now each row<br=
>
&gt; corresponds to a user and a list of users he has connected:<br>
&gt; format2:<br>
&gt; usera [userb,usere]<br>
&gt; userb [userc,userx,usery,usert]<br>
&gt; userc [userb]<br>
&gt; ..<br>
&gt; ..<br>
&gt; ..<br>
&gt;<br>
&gt; In order to get some numbers for sizing decision I have converted one-=
hour<br>
&gt; data (3 gb) into format2 result file is 155 mb.<br>
&gt; This one hour data contains 3272300 rows (vertices with neighbour list=
s).<br>
&gt; Although the size of my data decreases dramatically I couldn&#39;t fig=
ure out<br>
&gt; the size of 4tb data when converted to format2,<br>
&gt; The converted version of 4tb data will have 15 million of rows approxi=
mately<br>
&gt; but rows will have bigger neighbour lists then one-hour example data.<=
br>
&gt;<br>
&gt; Say I will have 15 millions of rows, each have approximatelly 50 users=
 in<br>
&gt; their neighbour list,<br>
&gt; what will be the approximate memory I need on whole cluster?<br>
&gt;<br>
&gt; Sorry for lots of question,<br>
&gt;<br>
&gt; best regards..<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On 20 February 2012 08:59, Avery Ching &lt;<a href=3D"mailto:aching@ap=
ache.org">aching@apache.org</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; Yes, you will need a lot of ram, until we get out-of-core partitio=
ns<br>
&gt;&gt; and/or out-of-core messages. =A0Do you really need to load all 4 T=
B of data?<br>
&gt;&gt; =A0The vertex index, vertex value, edge value, and message value o=
bjects all<br>
&gt;&gt; take up space as well as the data structures to store them (hence =
your<br>
&gt;&gt; estimates are definitely too low). =A0How big is the actual graph =
that you are<br>
&gt;&gt; trying to analyze in terms of vertices and edges?<br>
&gt;&gt;<br>
&gt;&gt; Avery<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On 2/19/12 10:45 PM, yavuz gokirmak wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Hi again,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; I am trying to estimate minimum requirements to process graph =
analysis<br>
&gt;&gt;&gt; over my input data,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; In shortest path example it is said that<br>
&gt;&gt;&gt; &quot;The first thing that happens is that getSplits() is call=
ed by the master<br>
&gt;&gt;&gt; and then the workers will process the InputSplit objects with =
the<br>
&gt;&gt;&gt; VertexReader to load their portion of the graph into memory&qu=
ot;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; What I undestood is in a time T all graph nodes must be loaded=
 on cluster<br>
&gt;&gt;&gt; memory.<br>
&gt;&gt;&gt; If I have 100 gb of graph data, will I need 25 machines having=
 4 gb ram<br>
&gt;&gt;&gt; each?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; If this is the case I have a big memory problem to anaylze 4tb=
 data :)<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; best regards.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
=A0 =A0Claudio Martella<br>
=A0 =A0<a href=3D"mailto:claudio.martella@gmail.com">claudio.martella@gmail=
.com</a><br>
</font></span></blockquote></div><br>

--e89a8f3b55d6cd7ec104b96ab5af--