[ https://issues.apache.org/jira/browse/FLINK2411?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Vasia Kalavri resolved FLINK2411.

Resolution: Implemented
Fix Version/s: 1.0
> Add basic graph summarization algorithm
> 
>
> Key: FLINK2411
> URL: https://issues.apache.org/jira/browse/FLINK2411
> Project: Flink
> Issue Type: New Feature
> Components: Gelly
> Affects Versions: 0.10
> Reporter: Martin Junghanns
> Assignee: Martin Junghanns
> Priority: Minor
> Fix For: 1.0
>
>
> Graph summarization determines a structural grouping of similar vertices and edges to
condense a graph and thus helps to uncover insights about patterns hidden in the graph. It
can be used in OLAPstyle operations on the graph and is similar to group by in SQL but on
the graph structure instead of rows.
>
> The graph summarization operator represents every vertex group by a single vertex in
the summarized graph; edges between vertices in the summary graph represent a group of edges
between the vertex group members of the original graph. Summarization is defined by specifying
grouping keys for vertices and edges, respectively.
> One publication that presents a Map/Reduce based approach is "Pagrol: Parallel graph
olap over largescale attributed graphs", however they precompute the graphcube before it
can be analyzed. With Flink, we can give the user an interactive way of summarizing the graph
and do not need to compute the cube beforehand.
> A more complex approach focuses on summarization on graph patterns "SynopSys: Large
Graph Analytics in the SAP HANA Database Through Summarization".
> However, I want to start with a simple algorithm that summarizes the graph on vertex
and optionally edge values and additionally stores a count aggregate at summarized vertices/edges.
> Consider the following two examples (e.g., social network with users from cities and
friendships with timestamp):
>
> h4. Input graph:
>
> Vertices (id, value):
> (0, Leipzig)
> (1, Leipzig)
> (2, Dresden)
> (3, Dresden)
> (4, Dresden)
> (5, Berlin)
> Edges (source, target, value):
> (0, 1, 2014)
> (1, 0, 2014)
> (1, 2, 2013)
> (2, 1, 2013)
> (2, 3, 2014)
> (3, 2, 2014)
> (4, 0, 2013)
> (4, 1, 2015)
> (5, 2, 2015)
> (5, 3, 2015)
> h4. Output graph (summarized on vertex value):
> Vertices (id, value, count)
> (0, Leipzig, 2) // "2 users from Leipzig"
> (2, Dresden, 3) // "3 users from Dresden"
> (5, Berlin, 1) // "1 user from Berlin"
> Edges (source, target, count)
> (0, 0, 2) // "2 edges between users in Leipzig"
> (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
> (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
> (2, 2, 2) // "2 edges between users in Dresden"
> (5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
> h4. Output graph (summarized on vertex and edge value):
> Vertices (id, value, count)
> (0, Leipzig, 2)
> (2, Dresden, 3)
> (5, Berlin, 1)
> Edges (source, target, value, count)
> (0, 0, 2014, 2) // ...
> (0, 2, 2013, 1) // ...
> (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp
2013"
> (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015"
> (2, 2, 2014, 2) // ...
> (5, 2, 2015, 2) // ...
> I've already implemented two versions of the summarization algorithm in our own project
[Gradoophttps://github.com/dbsleipzig/gradoop], which is a graph analytics stack on top
of Hadoop + Gelly/Flink with a fixed data model. You can see the current WIP here:
> 1 [Abstract summarizationhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
> 2 [Implementation using crosshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
> 3 [Implementation using joinshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
> 4 [Testshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/test/java/org/gradoop/model/impl/EPGraphSummarizeTest.java]
> 5 [TestGraphhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/devsupport/socialnetwork.pdf]
> I would basically use the same implementation as in 3 in combination with KeySelectors
to select the grouping keys on vertices and edges.
> As you can see in the example, each vertex in the resulting graph has a vertex id that
is contained in the original graph. This id is the smallest id among the grouped vertices
(e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter
is necessary to correctly assign the summarized edges. Maybe there is a smarter way to do
it of which I did not think of yet.
> I would like to contribute this to Flink and of course, if you have any suggestions/improvements
or do not want this at all (hopefully not), please let me know.

This message was sent by Atlassian JIRA
(v6.3.4#6332)
