Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
Received-SPF: pass (nike.apache.org: domain of ssc.open@googlemail.com
 designates 209.85.214.51 as permitted sender)
Message-ID: <519B6625.4040407@googlemail.com>
Date: Tue, 21 May 2013 14:18:45 +0200
From: Sebastian Schelter <ssc.open@googlemail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130330 Thunderbird/17.0.5
MIME-Version: 1.0
To: user@giraph.apache.org
Subject: Re: What if the resulting graph is larger than the memory?
References: 
 <CA+ndhHqhs1vULbwpFEis7R23Abxi4Ngs700RckB7cJTfCeQriA@mail.gmail.com>
 <1F592C080E9ACB4CB1C9EA1865BF3EFA0D19D963@PRN-MBX01-2.TheFacebook.com>
 <CA+ndhHpw3Bfg8qe_AoSVHdJQ_kL8Gb6xfdQmj4WLExKL07yz4g@mail.gmail.com>
 <519B4517.1020802@googlemail.com>
 <CA+ndhHrXQhkKG3zjdZ8mz7+HJfPNeKrvdwMz3C0GBnLLkTtUzQ@mail.gmail.com>
 <519B494D.4080307@googlemail.com>
 <CA+ndhHqphMro5OTW1OK_aLH6zOmzq1=R9qvQgiX9MX90EyrNgQ@mail.gmail.com>
In-Reply-To: 
 <CA+ndhHqphMro5OTW1OK_aLH6zOmzq1=R9qvQgiX9MX90EyrNgQ@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit

It simply means that not all partitions of the graph are in-memory all
the time. If you don't have enugh memory, some of them might get spilled
to disk.

On 21.05.2013 14:16, Han JU wrote:
> Thanks, that's a good point.
> But for the moment I just want to try out different solutions on hadoop and
> have a comparison of them. So I'd like to see how they perform under
> general conditions.
> 
> Do you happen to know what out-of-core graph means?
> 
> Thanks.
> 
> 
> 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
> 
>> Ah, I see. I have worked on similar things in recommender systems. Here
>> the problem is generally that you get a result quadratic to the number
>> of interactions per item. If you have some topsellers in your data,
>> those might make up for the large result. It helps very much to throw
>> out the few most popular items (if your application allows that).
>>
>> Best,
>> Sebastian
>>
>>
>> On 21.05.2013 12:10, Han JU wrote:
>>> Hi Sebastian,
>>>
>>> It's something like frequent item pairs out of transaction data.
>>> I need all these pairs with somehow a low support (say 2), so the result
>>> could be very big.
>>>
>>>
>>>
>>> 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
>>>
>>>> Hello Han,
>>>>
>>>> out of curiosity, what do you compute that has such a big result?
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 21.05.2013 11:52, Han JU wrote:
>>>>> Hi Maja,
>>>>>
>>>>> The input graph of my problem is not big, the calculation result is
>> very
>>>>> big.
>>>>> In fact what does out-of-core graph mean? Where can I find some
>> examples
>>>> of
>>>>> this and for output during computation?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> 2013/5/17 Maja Kabiljo <majakabiljo@fb.com>
>>>>>
>>>>>>  Hi JU,
>>>>>>
>>>>>>  One thing you can try is to use out-of-core graph
>>>>>> (giraph.useOutOfCoreGraph option).
>>>>>>
>>>>>>  I don't know what your exact use case is � do you have the graph
>> which
>>>>>> is huge or the data which you calculate in your application is? In the
>>>>>> second case, there is 'giraph.doOutputDuringComputation' option you
>>>> might
>>>>>> want to try out. When that is turned on, during each superstep
>>>> writeVertex
>>>>>> will be called immediately after compute for that vertex is called.
>> This
>>>>>> means that you can store data you want to write in vertex, write it
>> and
>>>>>> clear the data before going to the next vertex.
>>>>>>
>>>>>>  Maja
>>>>>>
>>>>>>   From: Han JU <ju.han.felix@gmail.com>
>>>>>> Reply-To: "user@giraph.apache.org" <user@giraph.apache.org>
>>>>>> Date: Friday, May 17, 2013 8:38 AM
>>>>>> To: "user@giraph.apache.org" <user@giraph.apache.org>
>>>>>> Subject: What if the resulting graph is larger than the memory?
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>  It's me again.
>>>>>> After a day's work I've coded a Giraph solution for my problem at
>> hand.
>>>> I
>>>>>> gave it a run on a medium dataset and it's notably faster than other
>>>>>> approaches.
>>>>>>
>>>>>>  However the goal is to process larger inputs, for example I've a
>> larger
>>>>>> dataset that the result graph is about 400GB when represented in edge
>>>>>> format and in text file. And I think the edges that the algorithm
>>>> created
>>>>>> all reside in the cluster's memory. So it means that for this big
>>>> dataset,
>>>>>> I need a cluster with ~ 400GB main memory to run? Is there any
>>>>>> possibilities that I can output "on the go" that means I don't need to
>>>>>> construct the whole graph, an edge is outputed to HDFS immediately
>>>> instead
>>>>>> of being created in main memory then be outputed?
>>>>>>
>>>>>>  Thanks!
>>>>>> --
>>>>>> *JU Han*
>>>>>>
>>>>>>    Software Engineer Intern @ KXEN Inc.
>>>>>>   UTC   -  Universit� de Technologie de Compi�gne
>>>>>>    *     **GI06 - Fouille de Donn�es et D�cisionnel*
>>>>>>
>>>>>>  +33 0619608888
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
>