Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F30DBDBFB for ; Tue, 21 May 2013 12:34:36 +0000 (UTC) Received: (qmail 17638 invoked by uid 500); 21 May 2013 12:34:37 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 17362 invoked by uid 500); 21 May 2013 12:34:36 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 17338 invoked by uid 99); 21 May 2013 12:34:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 May 2013 12:34:36 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of claudio.martella@gmail.com designates 209.85.128.170 as permitted sender) Received: from [209.85.128.170] (HELO mail-ve0-f170.google.com) (209.85.128.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 May 2013 12:34:30 +0000 Received: by mail-ve0-f170.google.com with SMTP id 15so401601vea.29 for ; Tue, 21 May 2013 05:34:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=ORmURqvPbkBeMvLsY99CWPabVHX5Mg5Uf2+rWAa6Uks=; b=eHpDKRVAnYoDaTetUUtjwMnA3564mikaKIn7CAn+HmF4cGkIdsZ2hllTQSM+e7PzkN 4MxQxxFBSxRbqea8Ne4ahfR9/VmUSELXG0xEUF7PNSEyZDImmX7WJPajbaJ2c84MxrVA SuprRsV8mmul3euMPPjd0kbiETrsgQQpoALNmuC5DESjI7YFSsTGbmxQnA39JYCADUTb InNTopSF5cZKDO6CK3XL3apAue99thIHIlHd7WbGNoqFv+3HEM/URBscX0wY/uKqS8jk 12JZrW/R2q3Hr4gM8MEHv1KIJNDKjtiymRX/NgGGWpCts7RGGLaGpWXvjw1IeKkYDq/n dp1Q== X-Received: by 10.220.11.1 with SMTP id r1mr802399vcr.14.1369139648991; Tue, 21 May 2013 05:34:08 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.114.134 with HTTP; Tue, 21 May 2013 05:33:48 -0700 (PDT) In-Reply-To: <519B6625.4040407@googlemail.com> References: <1F592C080E9ACB4CB1C9EA1865BF3EFA0D19D963@PRN-MBX01-2.TheFacebook.com> <519B4517.1020802@googlemail.com> <519B494D.4080307@googlemail.com> <519B6625.4040407@googlemail.com> From: Claudio Martella Date: Tue, 21 May 2013 14:33:48 +0200 Message-ID: Subject: Re: What if the resulting graph is larger than the memory? To: "user@giraph.apache.org" Content-Type: multipart/alternative; boundary=001a11c30d9e6d168a04dd39aa4b X-Virus-Checked: Checked by ClamAV on apache.org --001a11c30d9e6d168a04dd39aa4b Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Let me understand. You said that your graph is about 400GB, and that you don't have 400GB of main memory in your cluster (assuming for now that 400GB of edge-based input format would actually result in that amount of memory used on the heap, which is not the case). If THIS is your problem, and not for example that the MESSAGES you create would exceed your memory availability, or that your graph is going to grow even more during the computation (because you use mutation API), then going for out-of-core graph should be the way to go for you. It is very simple, you specify how many partitions a worker keeps in memory, and giraph will keep only this number of partitions in memory. The rest will be stored on disk and loaded into memory only when computed (resulting in one of the currently kept in memory to be spilled to disk). The options are giraph.useOutOfCore=3Dtrue and giraph.maxPartitionsInMemory (default 10) to control the number of partitions. On Tue, May 21, 2013 at 2:18 PM, Sebastian Schelter wrote: > It simply means that not all partitions of the graph are in-memory all > the time. If you don't have enugh memory, some of them might get spilled > to disk. > > On 21.05.2013 14:16, Han JU wrote: > > Thanks, that's a good point. > > But for the moment I just want to try out different solutions on hadoop > and > > have a comparison of them. So I'd like to see how they perform under > > general conditions. > > > > Do you happen to know what out-of-core graph means? > > > > Thanks. > > > > > > 2013/5/21 Sebastian Schelter > > > >> Ah, I see. I have worked on similar things in recommender systems. Her= e > >> the problem is generally that you get a result quadratic to the number > >> of interactions per item. If you have some topsellers in your data, > >> those might make up for the large result. It helps very much to throw > >> out the few most popular items (if your application allows that). > >> > >> Best, > >> Sebastian > >> > >> > >> On 21.05.2013 12:10, Han JU wrote: > >>> Hi Sebastian, > >>> > >>> It's something like frequent item pairs out of transaction data. > >>> I need all these pairs with somehow a low support (say 2), so the > result > >>> could be very big. > >>> > >>> > >>> > >>> 2013/5/21 Sebastian Schelter > >>> > >>>> Hello Han, > >>>> > >>>> out of curiosity, what do you compute that has such a big result? > >>>> > >>>> Best, > >>>> Sebastian > >>>> > >>>> On 21.05.2013 11:52, Han JU wrote: > >>>>> Hi Maja, > >>>>> > >>>>> The input graph of my problem is not big, the calculation result is > >> very > >>>>> big. > >>>>> In fact what does out-of-core graph mean? Where can I find some > >> examples > >>>> of > >>>>> this and for output during computation? > >>>>> > >>>>> Thanks. > >>>>> > >>>>> > >>>>> > >>>>> 2013/5/17 Maja Kabiljo > >>>>> > >>>>>> Hi JU, > >>>>>> > >>>>>> One thing you can try is to use out-of-core graph > >>>>>> (giraph.useOutOfCoreGraph option). > >>>>>> > >>>>>> I don't know what your exact use case is =96 do you have the grap= h > >> which > >>>>>> is huge or the data which you calculate in your application is? In > the > >>>>>> second case, there is 'giraph.doOutputDuringComputation' option yo= u > >>>> might > >>>>>> want to try out. When that is turned on, during each superstep > >>>> writeVertex > >>>>>> will be called immediately after compute for that vertex is called= . > >> This > >>>>>> means that you can store data you want to write in vertex, write i= t > >> and > >>>>>> clear the data before going to the next vertex. > >>>>>> > >>>>>> Maja > >>>>>> > >>>>>> From: Han JU > >>>>>> Reply-To: "user@giraph.apache.org" > >>>>>> Date: Friday, May 17, 2013 8:38 AM > >>>>>> To: "user@giraph.apache.org" > >>>>>> Subject: What if the resulting graph is larger than the memory? > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> It's me again. > >>>>>> After a day's work I've coded a Giraph solution for my problem at > >> hand. > >>>> I > >>>>>> gave it a run on a medium dataset and it's notably faster than oth= er > >>>>>> approaches. > >>>>>> > >>>>>> However the goal is to process larger inputs, for example I've a > >> larger > >>>>>> dataset that the result graph is about 400GB when represented in > edge > >>>>>> format and in text file. And I think the edges that the algorithm > >>>> created > >>>>>> all reside in the cluster's memory. So it means that for this big > >>>> dataset, > >>>>>> I need a cluster with ~ 400GB main memory to run? Is there any > >>>>>> possibilities that I can output "on the go" that means I don't nee= d > to > >>>>>> construct the whole graph, an edge is outputed to HDFS immediately > >>>> instead > >>>>>> of being created in main memory then be outputed? > >>>>>> > >>>>>> Thanks! > >>>>>> -- > >>>>>> *JU Han* > >>>>>> > >>>>>> Software Engineer Intern @ KXEN Inc. > >>>>>> UTC - Universit=E9 de Technologie de Compi=E8gne > >>>>>> * **GI06 - Fouille de Donn=E9es et D=E9cisionnel* > >>>>>> > >>>>>> +33 0619608888 > >>>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > --=20 Claudio Martella claudio.martella@gmail.com --001a11c30d9e6d168a04dd39aa4b Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
Let me understand. You said that your graph is about = 400GB, and that you don't have 400GB of main memory in your cluster (as= suming for now that 400GB of edge-based input format would actually result = in that amount of memory used on the heap, which is not the case). If THIS = is your problem, and not for example that the MESSAGES you create would exc= eed your memory availability, or that your graph is going to grow even more= during the computation (because you use mutation API), then going for out-= of-core graph should be the way to go for you. It is very simple, you speci= fy how many partitions a worker keeps in memory, and giraph will keep only = this number of partitions in memory. The rest will be stored on disk and lo= aded into memory only when computed (resulting in one of the currently kept= in memory to be spilled to disk).

The options are giraph.useOutOfCore=3Dtrue and giraph.maxPartitio= nsInMemory (default 10) to control the number of partitions.


On Tue, May 21, 2= 013 at 2:18 PM, Sebastian Schelter <ssc.open@googlemail.com><= /span> wrote:
It simply means that not all partitions of t= he graph are in-memory all
the time. If you don't have enugh memory, some of them might get spille= d
to disk.

On 21.05.2013 14:16, Han JU wrote:
> Thanks, that's a good point.
> But for the moment I just want to try out different solutions on hadoo= p and
> have a comparison of them. So I'd like to see how they perform und= er
> general conditions.
>
> Do you happen to know what out-of-core graph means?
>
> Thanks.
>
>
> 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
>
>> Ah, I see. I have worked on similar things in recommender systems.= Here
>> the problem is generally that you get a result quadratic to the nu= mber
>> of interactions per item. If you have some topsellers in your data= ,
>> those might make up for the large result. It helps very much to th= row
>> out the few most popular items (if your application allows that).<= br> >>
>> Best,
>> Sebastian
>>
>>
>> On 21.05.2013 12:10, Han JU wrote:
>>> Hi Sebastian,
>>>
>>> It's something like frequent item pairs out of transaction= data.
>>> I need all these pairs with somehow a low support (say 2), so = the result
>>> could be very big.
>>>
>>>
>>>
>>> 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
>>>
>>>> Hello Han,
>>>>
>>>> out of curiosity, what do you compute that has such a big = result?
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 21.05.2013 11:52, Han JU wrote:
>>>>> Hi Maja,
>>>>>
>>>>> The input graph of my problem is not big, the calculat= ion result is
>> very
>>>>> big.
>>>>> In fact what does out-of-core graph mean? Where can I = find some
>> examples
>>>> of
>>>>> this and for output during computation?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> 2013/5/17 Maja Kabiljo <majakabiljo@fb.com>
>>>>>
>>>>>> =A0Hi JU,
>>>>>>
>>>>>> =A0One thing you can try is to use out-of-core gra= ph
>>>>>> (giraph.useOutOfCoreGraph option).
>>>>>>
>>>>>> =A0I don't know what your exact use case is = =96 do you have the graph
>> which
>>>>>> is huge or the data which you calculate in your ap= plication is? In the
>>>>>> second case, there is 'giraph.doOutputDuringCo= mputation' option you
>>>> might
>>>>>> want to try out. When that is turned on, during ea= ch superstep
>>>> writeVertex
>>>>>> will be called immediately after compute for that = vertex is called.
>> This
>>>>>> means that you can store data you want to write in= vertex, write it
>> and
>>>>>> clear the data before going to the next vertex. >>>>>>
>>>>>> =A0Maja
>>>>>>
>>>>>> =A0 From: Han JU <ju.han.felix@gmail.com>
>>>>>> Reply-To: "user@giraph.apache.org" <user@giraph.apache.org>
>>>>>> Date: Friday, May 17, 2013 8:38 AM
>>>>>> To: "user@giraph.apache.org" <user@giraph.apache.org>
>>>>>> Subject: What if the resulting graph is larger tha= n the memory?
>>>>>>
>>>>>> =A0 Hi,
>>>>>>
>>>>>> =A0It's me again.
>>>>>> After a day's work I've coded a Giraph sol= ution for my problem at
>> hand.
>>>> I
>>>>>> gave it a run on a medium dataset and it's not= ably faster than other
>>>>>> approaches.
>>>>>>
>>>>>> =A0However the goal is to process larger inputs, f= or example I've a
>> larger
>>>>>> dataset that the result graph is about 400GB when = represented in edge
>>>>>> format and in text file. And I think the edges tha= t the algorithm
>>>> created
>>>>>> all reside in the cluster's memory. So it mean= s that for this big
>>>> dataset,
>>>>>> I need a cluster with ~ 400GB main memory to run? = Is there any
>>>>>> possibilities that I can output "on the go&qu= ot; that means I don't need to
>>>>>> construct the whole graph, an edge is outputed to = HDFS immediately
>>>> instead
>>>>>> of being created in main memory then be outputed?<= br> >>>>>>
>>>>>> =A0Thanks!
>>>>>> --
>>>>>> *JU Han*
>>>>>>
>>>>>> =A0 =A0Software Engineer Intern @ KXEN Inc.
>>>>>> =A0 UTC =A0 - =A0Universit=E9 de Technologie de Co= mpi=E8gne
>>>>>> =A0 =A0* =A0 =A0 **GI06 - Fouille de Donn=E9es et = D=E9cisionnel*
>>>>>>
>>>>>> =A0+33 0619608888
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>




--
=A0 =A0Cla= udio Martella
=A0 =A0claudio.martella@gmail.com=A0 =A0 --001a11c30d9e6d168a04dd39aa4b--