Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D12F710EEF for ; Thu, 19 Feb 2015 20:04:41 +0000 (UTC) Received: (qmail 5230 invoked by uid 500); 19 Feb 2015 20:04:41 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 5173 invoked by uid 500); 19 Feb 2015 20:04:41 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 5097 invoked by uid 99); 19 Feb 2015 20:04:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Feb 2015 20:04:41 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vasilikikalavri@gmail.com designates 209.85.217.177 as permitted sender) Received: from [209.85.217.177] (HELO mail-lb0-f177.google.com) (209.85.217.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Feb 2015 20:04:36 +0000 Received: by lbvn10 with SMTP id n10so2105373lbv.6 for ; Thu, 19 Feb 2015 12:03:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=NlzWPPteZ583CMPKHs6di3orynezy6Ap98lWAX6WP8I=; b=h+Fb4niN53tk9459x/hSVnqk4IOvbYpXZAbASZimblXegYFrFL6OO/wTbJ//8x85KQ qISxI3eWF/lx2OelZqWrCLSZ+EmwCeRMC/YcP8H0FJN1yLz8BRAOdxs4+Cd+H8EG2i4R fX3IpXW7p+W5Tl0l0Qyl0CvTHW1nqK2+BJMRomvzU+CGvKmYxxaPub077XuarD1nIjuM t0NZ9ttvpXlUS1ZDRPNKqkmWAn+iyGnC4zGR0BXyCiIYL7jVb55tX1vl/Wn2v4//qryt oJbWdBaLhzt+5H7xKtpjsJodE2kET9hVKj2L3d0yKKacaC64EG+XliR3GCkcahPoPfMU Rdwg== MIME-Version: 1.0 X-Received: by 10.112.13.38 with SMTP id e6mr5523313lbc.31.1424376210490; Thu, 19 Feb 2015 12:03:30 -0800 (PST) Received: by 10.114.172.164 with HTTP; Thu, 19 Feb 2015 12:03:30 -0800 (PST) Date: Thu, 19 Feb 2015 21:03:30 +0100 Message-ID: Subject: [DISCUSS] Gelly iteration abstractions From: Vasiliki Kalavri To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=001a11c3f2140d8ac1050f766e78 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3f2140d8ac1050f766e78 Content-Type: text/plain; charset=UTF-8 Hello beautiful Flink people, during the past few days, Andra and I have been discussing about how to extend Gelly's iteration methods. Alexander's course (and his awesome students) has made it obvious that vertex-centric iterations are not the best fit for algorithms which don't follow the common "propagate-update" pattern. For example, Andra is working on an implementation of Minimum Spanning Tree, which requires branching inside an iteration and also requires a convergence check of an internal iteration. Others also reported similar issues [1, 2]. Trying to fit such algorithms to the vertex-centric model leads to long and ugly code, e.g. aggregators to keep track of algorithm phases, duplicating data, etc. One limitation of the vertex-centric and the upcoming GAS model is that they both only allow the vertex values to be updated in each iteration. However, for some algorithms we need to update the edge values and in others we need to update both. In even more complex situations (like Andra's MST) in some iterations we need to update the vertex values and in some iterations we need to update the edge values. Another problem is that we currently don't have a way to allow different computational phases inside an iteration. This is something that Giraph solves with master compute, a function that is executed once before each superstep and sets the computation function. All that said, I believe that we can solve most of these issues if we nicely expose Flink's iteration operators in Gelly. I can see the following cases: 1. Bulk & delta iterations where the solution set is the vertex dataset: this will be similar to vertex-centric and GAS, but will allow more flexible dataflows inside the iteration. 2. Bulk & delta iterations where the solution set is the edge dataset: for the cases where we need to update edge values. 3. Bulk & delta iterations where the solution set is the Graph: this will cover more complex cases, where the algorithm updates both vertices and edges or even adds/removes vertices/edges, i.e. updates the whole Graph. What do you think? I can see 1 & 2 being very easy to implement, but I suspect 3 won't be that easy (but so awesome to have ^^). Would it work the way a Graph is represented now, i.e. with 2 DataSets? Any comment, idea, pointer would be much appreciated! Thank you ^^ Cheers, -V. [1]: http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html [2]: http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769 --001a11c3f2140d8ac1050f766e78--