Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE4EC986D for ; Tue, 21 May 2013 12:19:15 +0000 (UTC) Received: (qmail 71612 invoked by uid 500); 21 May 2013 12:19:15 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 71512 invoked by uid 500); 21 May 2013 12:19:14 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 71475 invoked by uid 99); 21 May 2013 12:19:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 May 2013 12:19:14 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ssc.open@googlemail.com designates 209.85.214.51 as permitted sender) Received: from [209.85.214.51] (HELO mail-bk0-f51.google.com) (209.85.214.51) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 May 2013 12:19:08 +0000 Received: by mail-bk0-f51.google.com with SMTP id ji2so330871bkc.10 for ; Tue, 21 May 2013 05:18:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; bh=LReC076mJFnnWSZUzoC49cjrgCW2/ZSreHlWGWeEdlY=; b=Fi0xXia5/F+lQUqIiH7o0w64Ai2SyJD2JogZdjNMm9Oe64mYag7XvKEpJX4v0RWcnu UMUrSBZJNcsKbvQ2tsvHGlDqjrViaUS7RAnJgKanugAeUZmYiMHrR+nBHaQzT6Y7L/PY T1TIcUWh7x4b2EmdXvFh88LcWwl5iWaNCgn4/Mbw+lhM7JWB5R7VzolrF2qumcbwtBrD L9xgHj2tfWPn+qOT/b+ZTxzJ6fd790V6UygtJ137gXpO4fu5Pjpij3MbXl2N2pnnp59J vf6iG00XJZNcBzNocFSuft0O7EtZWZOzEmyg4Nmtx3aoci5MFbeESMokBuZ3P2mkNA7I Gtkg== X-Received: by 10.204.61.71 with SMTP id s7mr1044958bkh.74.1369138727771; Tue, 21 May 2013 05:18:47 -0700 (PDT) Received: from [192.168.0.103] (f052132167.adsl.alicedsl.de. [78.52.132.167]) by mx.google.com with ESMTPSA id tc9sm606596bkb.18.2013.05.21.05.18.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 21 May 2013 05:18:46 -0700 (PDT) Message-ID: <519B6625.4040407@googlemail.com> Date: Tue, 21 May 2013 14:18:45 +0200 From: Sebastian Schelter User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130330 Thunderbird/17.0.5 MIME-Version: 1.0 To: user@giraph.apache.org Subject: Re: What if the resulting graph is larger than the memory? References: <1F592C080E9ACB4CB1C9EA1865BF3EFA0D19D963@PRN-MBX01-2.TheFacebook.com> <519B4517.1020802@googlemail.com> <519B494D.4080307@googlemail.com> In-Reply-To: X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org It simply means that not all partitions of the graph are in-memory all the time. If you don't have enugh memory, some of them might get spilled to disk. On 21.05.2013 14:16, Han JU wrote: > Thanks, that's a good point. > But for the moment I just want to try out different solutions on hadoop and > have a comparison of them. So I'd like to see how they perform under > general conditions. > > Do you happen to know what out-of-core graph means? > > Thanks. > > > 2013/5/21 Sebastian Schelter > >> Ah, I see. I have worked on similar things in recommender systems. Here >> the problem is generally that you get a result quadratic to the number >> of interactions per item. If you have some topsellers in your data, >> those might make up for the large result. It helps very much to throw >> out the few most popular items (if your application allows that). >> >> Best, >> Sebastian >> >> >> On 21.05.2013 12:10, Han JU wrote: >>> Hi Sebastian, >>> >>> It's something like frequent item pairs out of transaction data. >>> I need all these pairs with somehow a low support (say 2), so the result >>> could be very big. >>> >>> >>> >>> 2013/5/21 Sebastian Schelter >>> >>>> Hello Han, >>>> >>>> out of curiosity, what do you compute that has such a big result? >>>> >>>> Best, >>>> Sebastian >>>> >>>> On 21.05.2013 11:52, Han JU wrote: >>>>> Hi Maja, >>>>> >>>>> The input graph of my problem is not big, the calculation result is >> very >>>>> big. >>>>> In fact what does out-of-core graph mean? Where can I find some >> examples >>>> of >>>>> this and for output during computation? >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> >>>>> 2013/5/17 Maja Kabiljo >>>>> >>>>>> Hi JU, >>>>>> >>>>>> One thing you can try is to use out-of-core graph >>>>>> (giraph.useOutOfCoreGraph option). >>>>>> >>>>>> I don't know what your exact use case is � do you have the graph >> which >>>>>> is huge or the data which you calculate in your application is? In the >>>>>> second case, there is 'giraph.doOutputDuringComputation' option you >>>> might >>>>>> want to try out. When that is turned on, during each superstep >>>> writeVertex >>>>>> will be called immediately after compute for that vertex is called. >> This >>>>>> means that you can store data you want to write in vertex, write it >> and >>>>>> clear the data before going to the next vertex. >>>>>> >>>>>> Maja >>>>>> >>>>>> From: Han JU >>>>>> Reply-To: "user@giraph.apache.org" >>>>>> Date: Friday, May 17, 2013 8:38 AM >>>>>> To: "user@giraph.apache.org" >>>>>> Subject: What if the resulting graph is larger than the memory? >>>>>> >>>>>> Hi, >>>>>> >>>>>> It's me again. >>>>>> After a day's work I've coded a Giraph solution for my problem at >> hand. >>>> I >>>>>> gave it a run on a medium dataset and it's notably faster than other >>>>>> approaches. >>>>>> >>>>>> However the goal is to process larger inputs, for example I've a >> larger >>>>>> dataset that the result graph is about 400GB when represented in edge >>>>>> format and in text file. And I think the edges that the algorithm >>>> created >>>>>> all reside in the cluster's memory. So it means that for this big >>>> dataset, >>>>>> I need a cluster with ~ 400GB main memory to run? Is there any >>>>>> possibilities that I can output "on the go" that means I don't need to >>>>>> construct the whole graph, an edge is outputed to HDFS immediately >>>> instead >>>>>> of being created in main memory then be outputed? >>>>>> >>>>>> Thanks! >>>>>> -- >>>>>> *JU Han* >>>>>> >>>>>> Software Engineer Intern @ KXEN Inc. >>>>>> UTC - Universit� de Technologie de Compi�gne >>>>>> * **GI06 - Fouille de Donn�es et D�cisionnel* >>>>>> >>>>>> +33 0619608888 >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >