Return-Path: X-Original-To: apmail-incubator-giraph-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C2DE7B811 for ; Tue, 17 Jan 2012 01:24:27 +0000 (UTC) Received: (qmail 98841 invoked by uid 500); 17 Jan 2012 01:24:27 -0000 Delivered-To: apmail-incubator-giraph-dev-archive@incubator.apache.org Received: (qmail 98769 invoked by uid 500); 17 Jan 2012 01:24:27 -0000 Mailing-List: contact giraph-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-dev@incubator.apache.org Delivered-To: mailing list giraph-dev@incubator.apache.org Received: (qmail 98761 invoked by uid 99); 17 Jan 2012 01:24:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jan 2012 01:24:26 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of claudio.martella@gmail.com designates 209.85.210.175 as permitted sender) Received: from [209.85.210.175] (HELO mail-iy0-f175.google.com) (209.85.210.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jan 2012 01:24:20 +0000 Received: by iabz21 with SMTP id z21so2221305iab.6 for ; Mon, 16 Jan 2012 17:24:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=nUlkPXqBOXcqXTjaiIdCqZns7U0aRJvv8xke8EVOy/c=; b=UTqYL73c9WWqnNRf/jD9whUqIgGFd4B24n0prmqsGYBXzdjrO5WrpJ4jYZS+MTnLmX 46//DFMmhmXWH83vzKSCW9M4Lsl2y3nH0GhCrw47R7jtTokIrgSrdzqg2I2ASgf6JZw5 3ICk4+G4LQj3v7ncELtbjMEp4U/Kk5T66KjX8= Received: by 10.50.10.225 with SMTP id l1mr15219982igb.9.1326763440150; Mon, 16 Jan 2012 17:24:00 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.242.66 with HTTP; Mon, 16 Jan 2012 17:23:39 -0800 (PST) In-Reply-To: <4F10806B.6030403@apache.org> References: <4F0FDEA5.5000000@apache.org> <4F107B88.8010402@apache.org> <4F10806B.6030403@apache.org> From: Claudio Martella Date: Tue, 17 Jan 2012 02:23:39 +0100 Message-ID: Subject: Re: why we should remove implicit vertex creation To: giraph-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Avery, I inline the answers to your concerns. On Fri, Jan 13, 2012 at 8:05 PM, Avery Ching wrote: > Inline responses. > > Happy Friday, > > Avery > > > On 1/13/12 10:51 AM, Claudio Martella wrote: >> >> Hi Avery, >> >> thanks for your feedback. >> >> I'm advocating for allowing mutations only through Mutable interface >> methods. I agree that it can come handy to have the implicit vertex >> creation, for the reason you mentioned (which I called lazy inputset >> creation), but you can obtain the same through a simple =A0single M/R >> job run in advance. > > I think this is pretty expensive (extra MR job). =A0Users do have this op= tion, > but I doubt many would take it when they don't have to. Agreed, but if a graph is used multiple times, so if it's the input to multiple giraph jobs, the cost can be amortized quite a lot. You spend once to "complete the graph", you take advantage multiple times. Though I agree, the line here is quite fuzzy and I wouldn't haven considered it if it wasn't for the next point. > > >> What we win back is that we don't have the >> computational cost, and code complexity of checking if the vertex >> exists already for each message we get. > > Checking if the vertex exists is pretty cheap in a hashmap (constant time= ). > =A0We should verify that this is a computational overhead (maybe some > profiling) before optimizing it. =A0I suppose we could add a switch to by= pass > any graph mutation in general. True, but you've taken the best-case scenario. If you use range-partitioning then you hit a Tree. If we use another type of partitioning it might be even worse (think if the partitioning was held in a centralized store, maybe hbase, for topology partitioning). This for the vertex2partition function, which is not the expensive part, I guess. More than that I'm concerned about the exists() call to the vertex set as I'm thinking about sorting the vertex set, to allow for better iteration over out-of-core messages to completely avoid seeks. That would make the exists() quite expensive. I guess we can re-open the discussion once that code is ready. About that, did you have any chance to have a look at my email on the thread-safety of org.apache.giraph.graph.partition? Best, Claudio > >> You know what I mean? >> >> On Fri, Jan 13, 2012 at 7:44 PM, Avery Ching =A0wrote= : >>> >>> Claudio, >>> >>> What are you advocating in particular? >>> >>> Graph mutation should be allowed (i.e. adding vertices). =A0We allow th= is >>> to >>> happen through the addVertexReq() interface and through the >>> VertexResolver >>> implementation (say for messages to non-existent vertices). =A0I can se= e >>> why >>> this would be useful. =A0Imagine you are computing page rank on the web >>> graph, >>> but you only have a subset of the sites, but all the outlinks for each >>> site. >>> =A0It is nice to be able to allow new vertices (sites) while running th= e >>> application. >>> >>> I agree that the way that vertices are created and initialized is a bit >>> vague. =A0We can work on improving the interfaces if anyone has >>> suggestions. >>> >>> Avery >>> >>> >>> On 1/13/12 12:10 AM, Claudio Martella wrote: >>>> >>>> Hi Avery, >>>> >>>> thanks for your feedback. I know that users can decide to drop this >>>> behavior, but this doesn't mean that those three points don't hold, to >>>> me. >>>> >>>> On Fri, Jan 13, 2012 at 8:35 AM, Avery Ching >>>> =A0wrote: >>>>> >>>>> Claudio, >>>>> >>>>> You are right that vertices are created automatically when messages a= re >>>>> sent >>>>> to non-existent vertices. =A0But that behavior can be made applicatio= n >>>>> specific. =A0The default resolution of mutations/messages is >>>>> VertexResolver. >>>>> =A0But you are always welcome to implement your own application speci= fic >>>>> behavior. =A0For instance, you might just want to drop the message. = =A0If >>>>> there >>>>> is a simultaneous create/delete, you may want to always create. =A0Yo= u >>>>> have >>>>> the power to implement any behavior you want by setting the vertex >>>>> resolver >>>>> (see GiraphJob#setVertexResolverClass()). >>>>> >>>>> Hope this helps, >>>>> >>>>> Avery >>>>> >>>>> >>>>> On 1/12/12 3:42 PM, Claudio Martella wrote: >>>>>> >>>>>> Hello Giraphers, >>>>>> >>>>>> I have a few comments about the current design of Giraph regarding t= he >>>>>> implicit creation of vertices. >>>>>> As it's currently designed, if you send a message to a non-existent >>>>>> vertices, Giraph creates it for you. >>>>>> Although I can understand it can get handy as it allows for lazy >>>>>> dataset creation, I think it comes at some cost and I believe this >>>>>> cost is bigger than the advantage: >>>>>> >>>>>> 1) it overlaps the mutation API, where a vertex can be created >>>>>> explicitly when the semantics of the algorithm require it, with >>>>>> knowledge about what's going on and with explicit state. This is an >>>>>> ambiguous and unclear part of the API which is difficult for me to >>>>>> justify and probably confusing for the user too. Which brings me to >>>>>> the second point. >>>>>> >>>>>> 2) it requires a different, and partially duplicate,code path for >>>>>> mutations and implicit vertex creation in our code, as it's clear by >>>>>> looking at BasicRPCCommunication and as it's been experienced >>>>>> currently by me in the email I recently sent to the list. Which brin= gs >>>>>> me to the third point. >>>>>> >>>>>> 3) in order to manage this, for every message we have to hit, sooner >>>>>> or later, the Worker vertices set to see if the vertex is existing a= nd >>>>>> whether it should be implicitly created. This is computationally >>>>>> expensive both if you have a HashMap but also if you have a TreeMap >>>>>> for range partitioning. Also, if we're going to create more exotic >>>>>> partitioning (topology-partitioning?), we're going to hit the proble= m >>>>>> more. >>>>>> >>>>>> In general, I don't know any graph API that doesn't require to eithe= r >>>>>> list explicitly the vertex set at load or to create the vertex >>>>>> explicitly through API. As I said, I understand it allows for lazy >>>>>> creation of the input file, with possibly missing vertices explicitl= y >>>>>> enlisted (missing as a source vertex but existing as an endpoint for >>>>>> an edge), but this could be really fixed robustly by a single >>>>>> MapReduce job. >>>>>> >>>>>> What do you guys think? >>>>>> >>>> >> >> > --=20 =A0 =A0Claudio Martella =A0 =A0claudio.martella@gmail.com