Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
Received-SPF: pass (nike.apache.org: domain of claudio.martella@gmail.com
 designates 209.85.128.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <519B6625.4040407@googlemail.com>
References: 
 <CA+ndhHqhs1vULbwpFEis7R23Abxi4Ngs700RckB7cJTfCeQriA@mail.gmail.com>
 <1F592C080E9ACB4CB1C9EA1865BF3EFA0D19D963@PRN-MBX01-2.TheFacebook.com>
 <CA+ndhHpw3Bfg8qe_AoSVHdJQ_kL8Gb6xfdQmj4WLExKL07yz4g@mail.gmail.com>
 <519B4517.1020802@googlemail.com>
 <CA+ndhHrXQhkKG3zjdZ8mz7+HJfPNeKrvdwMz3C0GBnLLkTtUzQ@mail.gmail.com>
 <519B494D.4080307@googlemail.com>
 <CA+ndhHqphMro5OTW1OK_aLH6zOmzq1=R9qvQgiX9MX90EyrNgQ@mail.gmail.com>
 <519B6625.4040407@googlemail.com>
From: Claudio Martella <claudio.martella@gmail.com>
Date: Tue, 21 May 2013 14:33:48 +0200
Message-ID: 
 <CAFJOoJdSyRK9CoRegSBOE=BaKJovKfN9v3_DEXitr2eKYiFZLw@mail.gmail.com>
Subject: Re: What if the resulting graph is larger than the memory?
To: "user@giraph.apache.org" <user@giraph.apache.org>
Content-Type: multipart/alternative; boundary=001a11c30d9e6d168a04dd39aa4b

--001a11c30d9e6d168a04dd39aa4b
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Let me understand. You said that your graph is about 400GB, and that you
don't have 400GB of main memory in your cluster (assuming for now that
400GB of edge-based input format would actually result in that amount of
memory used on the heap, which is not the case). If THIS is your problem,
and not for example that the MESSAGES you create would exceed your memory
availability, or that your graph is going to grow even more during the
computation (because you use mutation API), then going for out-of-core
graph should be the way to go for you. It is very simple, you specify how
many partitions a worker keeps in memory, and giraph will keep only this
number of partitions in memory. The rest will be stored on disk and loaded
into memory only when computed (resulting in one of the currently kept in
memory to be spilled to disk).

The options are giraph.useOutOfCore=3Dtrue and giraph.maxPartitionsInMemory
(default 10) to control the number of partitions.


On Tue, May 21, 2013 at 2:18 PM, Sebastian Schelter <ssc.open@googlemail.co=
m
> wrote:

> It simply means that not all partitions of the graph are in-memory all
> the time. If you don't have enugh memory, some of them might get spilled
> to disk.
>
> On 21.05.2013 14:16, Han JU wrote:
> > Thanks, that's a good point.
> > But for the moment I just want to try out different solutions on hadoop
> and
> > have a comparison of them. So I'd like to see how they perform under
> > general conditions.
> >
> > Do you happen to know what out-of-core graph means?
> >
> > Thanks.
> >
> >
> > 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
> >
> >> Ah, I see. I have worked on similar things in recommender systems. Her=
e
> >> the problem is generally that you get a result quadratic to the number
> >> of interactions per item. If you have some topsellers in your data,
> >> those might make up for the large result. It helps very much to throw
> >> out the few most popular items (if your application allows that).
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 21.05.2013 12:10, Han JU wrote:
> >>> Hi Sebastian,
> >>>
> >>> It's something like frequent item pairs out of transaction data.
> >>> I need all these pairs with somehow a low support (say 2), so the
> result
> >>> could be very big.
> >>>
> >>>
> >>>
> >>> 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
> >>>
> >>>> Hello Han,
> >>>>
> >>>> out of curiosity, what do you compute that has such a big result?
> >>>>
> >>>> Best,
> >>>> Sebastian
> >>>>
> >>>> On 21.05.2013 11:52, Han JU wrote:
> >>>>> Hi Maja,
> >>>>>
> >>>>> The input graph of my problem is not big, the calculation result is
> >> very
> >>>>> big.
> >>>>> In fact what does out-of-core graph mean? Where can I find some
> >> examples
> >>>> of
> >>>>> this and for output during computation?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2013/5/17 Maja Kabiljo <majakabiljo@fb.com>
> >>>>>
> >>>>>>  Hi JU,
> >>>>>>
> >>>>>>  One thing you can try is to use out-of-core graph
> >>>>>> (giraph.useOutOfCoreGraph option).
> >>>>>>
> >>>>>>  I don't know what your exact use case is =96 do you have the grap=
h
> >> which
> >>>>>> is huge or the data which you calculate in your application is? In
> the
> >>>>>> second case, there is 'giraph.doOutputDuringComputation' option yo=
u
> >>>> might
> >>>>>> want to try out. When that is turned on, during each superstep
> >>>> writeVertex
> >>>>>> will be called immediately after compute for that vertex is called=
.
> >> This
> >>>>>> means that you can store data you want to write in vertex, write i=
t
> >> and
> >>>>>> clear the data before going to the next vertex.
> >>>>>>
> >>>>>>  Maja
> >>>>>>
> >>>>>>   From: Han JU <ju.han.felix@gmail.com>
> >>>>>> Reply-To: "user@giraph.apache.org" <user@giraph.apache.org>
> >>>>>> Date: Friday, May 17, 2013 8:38 AM
> >>>>>> To: "user@giraph.apache.org" <user@giraph.apache.org>
> >>>>>> Subject: What if the resulting graph is larger than the memory?
> >>>>>>
> >>>>>>   Hi,
> >>>>>>
> >>>>>>  It's me again.
> >>>>>> After a day's work I've coded a Giraph solution for my problem at
> >> hand.
> >>>> I
> >>>>>> gave it a run on a medium dataset and it's notably faster than oth=
er
> >>>>>> approaches.
> >>>>>>
> >>>>>>  However the goal is to process larger inputs, for example I've a
> >> larger
> >>>>>> dataset that the result graph is about 400GB when represented in
> edge
> >>>>>> format and in text file. And I think the edges that the algorithm
> >>>> created
> >>>>>> all reside in the cluster's memory. So it means that for this big
> >>>> dataset,
> >>>>>> I need a cluster with ~ 400GB main memory to run? Is there any
> >>>>>> possibilities that I can output "on the go" that means I don't nee=
d
> to
> >>>>>> construct the whole graph, an edge is outputed to HDFS immediately
> >>>> instead
> >>>>>> of being created in main memory then be outputed?
> >>>>>>
> >>>>>>  Thanks!
> >>>>>> --
> >>>>>> *JU Han*
> >>>>>>
> >>>>>>    Software Engineer Intern @ KXEN Inc.
> >>>>>>   UTC   -  Universit=E9 de Technologie de Compi=E8gne
> >>>>>>    *     **GI06 - Fouille de Donn=E9es et D=E9cisionnel*
> >>>>>>
> >>>>>>  +33 0619608888
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


--=20
   Claudio Martella
   claudio.martella@gmail.com

--001a11c30d9e6d168a04dd39aa4b
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Let me understand. You said that your graph is about =
400GB, and that you don&#39;t have 400GB of main memory in your cluster (as=
suming for now that 400GB of edge-based input format would actually result =
in that amount of memory used on the heap, which is not the case). If THIS =
is your problem, and not for example that the MESSAGES you create would exc=
eed your memory availability, or that your graph is going to grow even more=
 during the computation (because you use mutation API), then going for out-=
of-core graph should be the way to go for you. It is very simple, you speci=
fy how many partitions a worker keeps in memory, and giraph will keep only =
this number of partitions in memory. The rest will be stored on disk and lo=
aded into memory only when computed (resulting in one of the currently kept=
 in memory to be spilled to disk).<br>

<br></div>The options are giraph.useOutOfCore=3Dtrue and giraph.maxPartitio=
nsInMemory (default 10) to control the number of partitions. <br></div><div=
 class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Tue, May 21, 2=
013 at 2:18 PM, Sebastian Schelter <span dir=3D"ltr">&lt;<a href=3D"mailto:=
ssc.open@googlemail.com" target=3D"_blank">ssc.open@googlemail.com</a>&gt;<=
/span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">It simply means that not all partitions of t=
he graph are in-memory all<br>
the time. If you don&#39;t have enugh memory, some of them might get spille=
d<br>
to disk.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On 21.05.2013 14:16, Han JU wrote:<br>
&gt; Thanks, that&#39;s a good point.<br>
&gt; But for the moment I just want to try out different solutions on hadoo=
p and<br>
&gt; have a comparison of them. So I&#39;d like to see how they perform und=
er<br>
&gt; general conditions.<br>
&gt;<br>
&gt; Do you happen to know what out-of-core graph means?<br>
&gt;<br>
&gt; Thanks.<br>
&gt;<br>
&gt;<br>
&gt; 2013/5/21 Sebastian Schelter &lt;<a href=3D"mailto:ssc.open@googlemail=
.com">ssc.open@googlemail.com</a>&gt;<br>
&gt;<br>
&gt;&gt; Ah, I see. I have worked on similar things in recommender systems.=
 Here<br>
&gt;&gt; the problem is generally that you get a result quadratic to the nu=
mber<br>
&gt;&gt; of interactions per item. If you have some topsellers in your data=
,<br>
&gt;&gt; those might make up for the large result. It helps very much to th=
row<br>
&gt;&gt; out the few most popular items (if your application allows that).<=
br>
&gt;&gt;<br>
&gt;&gt; Best,<br>
&gt;&gt; Sebastian<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On 21.05.2013 12:10, Han JU wrote:<br>
&gt;&gt;&gt; Hi Sebastian,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; It&#39;s something like frequent item pairs out of transaction=
 data.<br>
&gt;&gt;&gt; I need all these pairs with somehow a low support (say 2), so =
the result<br>
&gt;&gt;&gt; could be very big.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; 2013/5/21 Sebastian Schelter &lt;<a href=3D"mailto:ssc.open@go=
oglemail.com">ssc.open@googlemail.com</a>&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Hello Han,<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; out of curiosity, what do you compute that has such a big =
result?<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Best,<br>
&gt;&gt;&gt;&gt; Sebastian<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; On 21.05.2013 11:52, Han JU wrote:<br>
&gt;&gt;&gt;&gt;&gt; Hi Maja,<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; The input graph of my problem is not big, the calculat=
ion result is<br>
&gt;&gt; very<br>
&gt;&gt;&gt;&gt;&gt; big.<br>
&gt;&gt;&gt;&gt;&gt; In fact what does out-of-core graph mean? Where can I =
find some<br>
&gt;&gt; examples<br>
&gt;&gt;&gt;&gt; of<br>
&gt;&gt;&gt;&gt;&gt; this and for output during computation?<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Thanks.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; 2013/5/17 Maja Kabiljo &lt;<a href=3D"mailto:majakabil=
jo@fb.com">majakabiljo@fb.com</a>&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0Hi JU,<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0One thing you can try is to use out-of-core gra=
ph<br>
&gt;&gt;&gt;&gt;&gt;&gt; (giraph.useOutOfCoreGraph option).<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0I don&#39;t know what your exact use case is =
=96 do you have the graph<br>
&gt;&gt; which<br>
&gt;&gt;&gt;&gt;&gt;&gt; is huge or the data which you calculate in your ap=
plication is? In the<br>
&gt;&gt;&gt;&gt;&gt;&gt; second case, there is &#39;giraph.doOutputDuringCo=
mputation&#39; option you<br>
&gt;&gt;&gt;&gt; might<br>
&gt;&gt;&gt;&gt;&gt;&gt; want to try out. When that is turned on, during ea=
ch superstep<br>
&gt;&gt;&gt;&gt; writeVertex<br>
&gt;&gt;&gt;&gt;&gt;&gt; will be called immediately after compute for that =
vertex is called.<br>
&gt;&gt; This<br>
&gt;&gt;&gt;&gt;&gt;&gt; means that you can store data you want to write in=
 vertex, write it<br>
&gt;&gt; and<br>
&gt;&gt;&gt;&gt;&gt;&gt; clear the data before going to the next vertex.<br=
>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0Maja<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0 From: Han JU &lt;<a href=3D"mailto:ju.han.feli=
x@gmail.com">ju.han.felix@gmail.com</a>&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Reply-To: &quot;<a href=3D"mailto:user@giraph.apac=
he.org">user@giraph.apache.org</a>&quot; &lt;<a href=3D"mailto:user@giraph.=
apache.org">user@giraph.apache.org</a>&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Date: Friday, May 17, 2013 8:38 AM<br>
&gt;&gt;&gt;&gt;&gt;&gt; To: &quot;<a href=3D"mailto:user@giraph.apache.org=
">user@giraph.apache.org</a>&quot; &lt;<a href=3D"mailto:user@giraph.apache=
.org">user@giraph.apache.org</a>&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Subject: What if the resulting graph is larger tha=
n the memory?<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0 Hi,<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0It&#39;s me again.<br>
&gt;&gt;&gt;&gt;&gt;&gt; After a day&#39;s work I&#39;ve coded a Giraph sol=
ution for my problem at<br>
&gt;&gt; hand.<br>
&gt;&gt;&gt;&gt; I<br>
&gt;&gt;&gt;&gt;&gt;&gt; gave it a run on a medium dataset and it&#39;s not=
ably faster than other<br>
&gt;&gt;&gt;&gt;&gt;&gt; approaches.<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0However the goal is to process larger inputs, f=
or example I&#39;ve a<br>
&gt;&gt; larger<br>
&gt;&gt;&gt;&gt;&gt;&gt; dataset that the result graph is about 400GB when =
represented in edge<br>
&gt;&gt;&gt;&gt;&gt;&gt; format and in text file. And I think the edges tha=
t the algorithm<br>
&gt;&gt;&gt;&gt; created<br>
&gt;&gt;&gt;&gt;&gt;&gt; all reside in the cluster&#39;s memory. So it mean=
s that for this big<br>
&gt;&gt;&gt;&gt; dataset,<br>
&gt;&gt;&gt;&gt;&gt;&gt; I need a cluster with ~ 400GB main memory to run? =
Is there any<br>
&gt;&gt;&gt;&gt;&gt;&gt; possibilities that I can output &quot;on the go&qu=
ot; that means I don&#39;t need to<br>
&gt;&gt;&gt;&gt;&gt;&gt; construct the whole graph, an edge is outputed to =
HDFS immediately<br>
&gt;&gt;&gt;&gt; instead<br>
&gt;&gt;&gt;&gt;&gt;&gt; of being created in main memory then be outputed?<=
br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0Thanks!<br>
&gt;&gt;&gt;&gt;&gt;&gt; --<br>
&gt;&gt;&gt;&gt;&gt;&gt; *JU Han*<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0 =A0Software Engineer Intern @ KXEN Inc.<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0 UTC =A0 - =A0Universit=E9 de Technologie de Co=
mpi=E8gne<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0 =A0* =A0 =A0 **GI06 - Fouille de Donn=E9es et =
D=E9cisionnel*<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; =A0<a href=3D"tel:%2B33%200619608888" value=3D"+33=
619608888">+33 0619608888</a><br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
&gt;<br>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br> =A0 =A0Cla=
udio Martella<br> =A0 =A0<a href=3D"mailto:claudio.martella@gmail.com" targ=
et=3D"_blank">claudio.martella@gmail.com</a>=A0 =A0
</div>

--001a11c30d9e6d168a04dd39aa4b--