giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: What if the resulting graph is larger than the memory?
Date Tue, 21 May 2013 12:33:48 GMT
Let me understand. You said that your graph is about 400GB, and that you
don't have 400GB of main memory in your cluster (assuming for now that
400GB of edge-based input format would actually result in that amount of
memory used on the heap, which is not the case). If THIS is your problem,
and not for example that the MESSAGES you create would exceed your memory
availability, or that your graph is going to grow even more during the
computation (because you use mutation API), then going for out-of-core
graph should be the way to go for you. It is very simple, you specify how
many partitions a worker keeps in memory, and giraph will keep only this
number of partitions in memory. The rest will be stored on disk and loaded
into memory only when computed (resulting in one of the currently kept in
memory to be spilled to disk).

The options are giraph.useOutOfCore=true and giraph.maxPartitionsInMemory
(default 10) to control the number of partitions.


On Tue, May 21, 2013 at 2:18 PM, Sebastian Schelter <ssc.open@googlemail.com
> wrote:

> It simply means that not all partitions of the graph are in-memory all
> the time. If you don't have enugh memory, some of them might get spilled
> to disk.
>
> On 21.05.2013 14:16, Han JU wrote:
> > Thanks, that's a good point.
> > But for the moment I just want to try out different solutions on hadoop
> and
> > have a comparison of them. So I'd like to see how they perform under
> > general conditions.
> >
> > Do you happen to know what out-of-core graph means?
> >
> > Thanks.
> >
> >
> > 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
> >
> >> Ah, I see. I have worked on similar things in recommender systems. Here
> >> the problem is generally that you get a result quadratic to the number
> >> of interactions per item. If you have some topsellers in your data,
> >> those might make up for the large result. It helps very much to throw
> >> out the few most popular items (if your application allows that).
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 21.05.2013 12:10, Han JU wrote:
> >>> Hi Sebastian,
> >>>
> >>> It's something like frequent item pairs out of transaction data.
> >>> I need all these pairs with somehow a low support (say 2), so the
> result
> >>> could be very big.
> >>>
> >>>
> >>>
> >>> 2013/5/21 Sebastian Schelter <ssc.open@googlemail.com>
> >>>
> >>>> Hello Han,
> >>>>
> >>>> out of curiosity, what do you compute that has such a big result?
> >>>>
> >>>> Best,
> >>>> Sebastian
> >>>>
> >>>> On 21.05.2013 11:52, Han JU wrote:
> >>>>> Hi Maja,
> >>>>>
> >>>>> The input graph of my problem is not big, the calculation result
is
> >> very
> >>>>> big.
> >>>>> In fact what does out-of-core graph mean? Where can I find some
> >> examples
> >>>> of
> >>>>> this and for output during computation?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2013/5/17 Maja Kabiljo <majakabiljo@fb.com>
> >>>>>
> >>>>>>  Hi JU,
> >>>>>>
> >>>>>>  One thing you can try is to use out-of-core graph
> >>>>>> (giraph.useOutOfCoreGraph option).
> >>>>>>
> >>>>>>  I don't know what your exact use case is – do you have the
graph
> >> which
> >>>>>> is huge or the data which you calculate in your application
is? In
> the
> >>>>>> second case, there is 'giraph.doOutputDuringComputation' option
you
> >>>> might
> >>>>>> want to try out. When that is turned on, during each superstep
> >>>> writeVertex
> >>>>>> will be called immediately after compute for that vertex is
called.
> >> This
> >>>>>> means that you can store data you want to write in vertex, write
it
> >> and
> >>>>>> clear the data before going to the next vertex.
> >>>>>>
> >>>>>>  Maja
> >>>>>>
> >>>>>>   From: Han JU <ju.han.felix@gmail.com>
> >>>>>> Reply-To: "user@giraph.apache.org" <user@giraph.apache.org>
> >>>>>> Date: Friday, May 17, 2013 8:38 AM
> >>>>>> To: "user@giraph.apache.org" <user@giraph.apache.org>
> >>>>>> Subject: What if the resulting graph is larger than the memory?
> >>>>>>
> >>>>>>   Hi,
> >>>>>>
> >>>>>>  It's me again.
> >>>>>> After a day's work I've coded a Giraph solution for my problem
at
> >> hand.
> >>>> I
> >>>>>> gave it a run on a medium dataset and it's notably faster than
other
> >>>>>> approaches.
> >>>>>>
> >>>>>>  However the goal is to process larger inputs, for example I've
a
> >> larger
> >>>>>> dataset that the result graph is about 400GB when represented
in
> edge
> >>>>>> format and in text file. And I think the edges that the algorithm
> >>>> created
> >>>>>> all reside in the cluster's memory. So it means that for this
big
> >>>> dataset,
> >>>>>> I need a cluster with ~ 400GB main memory to run? Is there any
> >>>>>> possibilities that I can output "on the go" that means I don't
need
> to
> >>>>>> construct the whole graph, an edge is outputed to HDFS immediately
> >>>> instead
> >>>>>> of being created in main memory then be outputed?
> >>>>>>
> >>>>>>  Thanks!
> >>>>>> --
> >>>>>> *JU Han*
> >>>>>>
> >>>>>>    Software Engineer Intern @ KXEN Inc.
> >>>>>>   UTC   -  Université de Technologie de Compiègne
> >>>>>>    *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>
> >>>>>>  +33 0619608888
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
   Claudio Martella
   claudio.martella@gmail.com

Mime
View raw message