Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5207D17CAD for ; Mon, 27 Jul 2015 15:43:14 +0000 (UTC) Received: (qmail 99594 invoked by uid 500); 27 Jul 2015 15:43:04 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 99561 invoked by uid 500); 27 Jul 2015 15:43:04 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 99403 invoked by uid 99); 27 Jul 2015 15:43:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2015 15:43:04 +0000 Date: Mon, 27 Jul 2015 15:43:04 +0000 (UTC) From: "Martin Junghanns (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (FLINK-2411) Add basic graph summarization algorithm MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Junghanns updated FLINK-2411: ------------------------------------ Description: Graph summarization determines a structural grouping of similar vertices and edges to condense a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used in OLAP-style operations on the graph and is similar to group by in SQL but on the graph structure instead of rows. The graph summarization operator represents every vertex group by a single vertex in the summarized graph; edges between vertices in the summary graph represent a group of edges between the vertex group members of the original graph. Summarization is defined by specifying grouping keys for vertices and edges, respectively. One publication that presents a Map/Reduce based approach is "Pagrol: Parallel graph olap over large-scale attributed graphs", however they pre-compute the graph-cube before it can be analyzed. With Flink, we can give the user an interactive way of summarizing the graph and do not need to compute the cube beforehand. A more complex approach focuses on summarization on graph patterns "SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization". However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally edge values and additionally stores a count aggregate at summarized vertices/edges. Consider the following two examples (e.g., social network with users from cities and friendships with timestamp): h4. Input graph: Vertices (id, value): (0, Leipzig) (1, Leipzig) (2, Dresden) (3, Dresden) (4, Dresden) (5, Berlin) Edges (source, target, value): (0, 1, 2014) (1, 0, 2014) (1, 2, 2013) (2, 1, 2013) (2, 3, 2014) (3, 2, 2014) (4, 0, 2013) (4, 1, 2015) (5, 2, 2015) (5, 3, 2015) h4. Output graph (summarized on vertex value): Vertices (id, value, count) (0, Leipzig, 2) // "2 users from Leipzig" (2, Dresden, 3) // "3 users from Dresden" (5, Berlin, 1) // "1 user from Berlin" Edges (source, target, count) (0, 0, 2) // "2 edges between users in Leipzig" (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden" (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig" (2, 2, 2) // "2 edges between users in Dresden" (5, 2, 2) // "2 edges from users in Berlin to users in Dresden" h4. Output graph (summarized on vertex and edge value): Vertices (id, value, count) (0, Leipzig, 2) (2, Dresden, 3) (5, Berlin, 1) Edges (source, target, value, count) (0, 0, 2014, 2) // ... (0, 2, 2013, 1) // ... (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013" (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015" (2, 2, 2014, 2) // ... (5, 2, 2015, 2) // ... I've already implemented two versions of the summarization algorithm in our own project https://github.com/dbs-leipzig/gradoop, which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You can see the current WIP here: 1 [Abstract summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java] 2 [Implementation using cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java] 3 [Implementation using joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java] 4 [Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java 5 [TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf] I would basically use the same implementation as in 3 in combination with KeySelectors to select the grouping keys on vertices and edges. As you can see in the example, each vertex in the resulting graph has a vertex id that is contained in the original graph. This id is the smallest id among the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary to correctly assign the summarized edges. Maybe there is a smarter way to do it of which I did not think yet. I would like contribute this to Flink and of course, if you have any suggestions/improvements or do not want this at all (hopefully not), please let me know. was: Graph summarization determines a structural grouping of similar vertices and edges to condense a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used in OLAP-style operations on the graph and is similar to group by in SQL but on the graph structure instead of rows. The graph summarization operator represents every vertex group by a single vertex in the summarized graph; edges between vertices in the summary graph represent a group of edges between the vertex group members of the original graph. Summarization is defined by specifying grouping keys for vertices and edges, respectively. One publication that presents a Map/Reduce based approach is "Pagrol: Parallel graph olap over large-scale attributed graphs", however they pre-compute the graph-cube before it can be analyzed. With Flink, we can give the user an interactive way of summarizing the graph and do not need to compute the cube beforehand. A more complex approach focuses on summarization on graph patterns "SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization". However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally edge values and additionally stores a count aggregate at summarized vertices/edges. Consider the following two examples (e.g., social network with users from cities and friendships with timestamp): h4. Input graph: Vertices (id, value): (0, Leipzig) (1, Leipzig) (2, Dresden) (3, Dresden) (4, Dresden) (5, Berlin) Edges (source, target, value): (0, 1, 2014) (1, 0, 2014) (1, 2, 2013) (2, 1, 2013) (2, 3, 2014) (3, 2, 2014) (4, 0, 2013) (4, 1, 2015) (5, 2, 2015) (5, 3, 2015) h4. Output graph (summarized on vertex value): Vertices (id, value, count) (0, Leipzig, 2) // "2 users from Leipzig" (2, Dresden, 3) // "3 users from Dresden" (5, Berlin, 1) // "1 user from Berlin" Edges (source, target, count) (0, 0, 2) // "2 edges between users in Leipzig" (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden" (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig" (2, 2, 2) // "2 edges between users in Dresden" (5, 2, 2) // "2 edges from users in Berlin to users in Dresden" h4. Output graph (summarized on vertex and edge value): Vertices (id, value, count) (0, Leipzig, 2) (2, Dresden, 3) (5, Berlin, 1) Edges (source, target, value, count) (0, 0, 2014, 2) // ... (0, 2, 2013, 1) // ... (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013" (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015" (2, 2, 2014, 2) // ... (5, 2, 2015, 2) // ... I've already implemented two versions of the summarization algorithm in our own project https://github.com/dbs-leipzig/gradoop, which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You can see the current WIP here: 1 [Abstract summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java] 2 [Implementation using cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java] 3 [Implementation using joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java] 4 [Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java 5 [TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf] I would basically use the same implementation as in 3 in combination with KeySelectors to select the grouping keys on vertices and edges. As you can see in the example, each vertex in the resulting graph has a vertex id that is contained in the original graph. This id is the smallest id among the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary to correctly assign the summarized edges. Maybe there is a smarter way to do it of which I did not think yet. I would like contribute this to Flink and of course, if you have any suggestions/improvements or do not want this at all (hopefully not), please let me know. > Add basic graph summarization algorithm > --------------------------------------- > > Key: FLINK-2411 > URL: https://issues.apache.org/jira/browse/FLINK-2411 > Project: Flink > Issue Type: New Feature > Components: Gelly > Affects Versions: master > Reporter: Martin Junghanns > Priority: Minor > > Graph summarization determines a structural grouping of similar vertices and edges to condense a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used in OLAP-style operations on the graph and is similar to group by in SQL but on the graph structure instead of rows. > > The graph summarization operator represents every vertex group by a single vertex in the summarized graph; edges between vertices in the summary graph represent a group of edges between the vertex group members of the original graph. Summarization is defined by specifying grouping keys for vertices and edges, respectively. > One publication that presents a Map/Reduce based approach is "Pagrol: Parallel graph olap over large-scale attributed graphs", however they pre-compute the graph-cube before it can be analyzed. With Flink, we can give the user an interactive way of summarizing the graph and do not need to compute the cube beforehand. > A more complex approach focuses on summarization on graph patterns "SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization". > However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally edge values and additionally stores a count aggregate at summarized vertices/edges. > Consider the following two examples (e.g., social network with users from cities and friendships with timestamp): > > h4. Input graph: > > Vertices (id, value): > (0, Leipzig) > (1, Leipzig) > (2, Dresden) > (3, Dresden) > (4, Dresden) > (5, Berlin) > Edges (source, target, value): > (0, 1, 2014) > (1, 0, 2014) > (1, 2, 2013) > (2, 1, 2013) > (2, 3, 2014) > (3, 2, 2014) > (4, 0, 2013) > (4, 1, 2015) > (5, 2, 2015) > (5, 3, 2015) > h4. Output graph (summarized on vertex value): > Vertices (id, value, count) > (0, Leipzig, 2) // "2 users from Leipzig" > (2, Dresden, 3) // "3 users from Dresden" > (5, Berlin, 1) // "1 user from Berlin" > Edges (source, target, count) > (0, 0, 2) // "2 edges between users in Leipzig" > (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden" > (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig" > (2, 2, 2) // "2 edges between users in Dresden" > (5, 2, 2) // "2 edges from users in Berlin to users in Dresden" > h4. Output graph (summarized on vertex and edge value): > Vertices (id, value, count) > (0, Leipzig, 2) > (2, Dresden, 3) > (5, Berlin, 1) > Edges (source, target, value, count) > (0, 0, 2014, 2) // ... > (0, 2, 2013, 1) // ... > (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013" > (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015" > (2, 2, 2014, 2) // ... > (5, 2, 2015, 2) // ... > I've already implemented two versions of the summarization algorithm in our own project https://github.com/dbs-leipzig/gradoop, > which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You can see the current WIP here: > 1 [Abstract summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java] > 2 [Implementation using cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java] > 3 [Implementation using joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java] > 4 [Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java > 5 [TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf] > I would basically use the same implementation as in 3 in combination with KeySelectors to select the grouping keys on vertices and edges. > As you can see in the example, each vertex in the resulting graph has a vertex id that is contained in the original graph. This id is the smallest id among the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary to correctly assign the summarized edges. Maybe there is a smarter way > to do it of which I did not think yet. > I would like contribute this to Flink and of course, if you have any suggestions/improvements or do not want this at all (hopefully not), please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)