Martin Junghanns created FLINK2411:

Summary: Add basic graph summarization algorithm
Key: FLINK2411
URL: https://issues.apache.org/jira/browse/FLINK2411
Project: Flink
Issue Type: New Feature
Components: Gelly
Affects Versions: master
Reporter: Martin Junghanns
Priority: Minor
Graph summarization determines a structural grouping of similar vertices and edges to condense
a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used
in OLAPstyle operations on the graph and is similar to group by in SQL but on the graph structure
instead of rows.
The graph summarization operator represents every vertex group by a single vertex in the summarized
graph; edges between vertices in the summary graph represent a group of edges between the
vertex group members of the original graph. Summarization is defined by specifying grouping
keys for vertices and edges, respectively.
One publication that presents a Map/Reduce based approach is
"Pagrol: Parallel graph olap over largescale attributed graphs", however they precompute
the graphcube before it can be analyzed. With Flink, we can give the user an interactive
way of summarizing the graph and do not need to compute the cube beforehand.
A more complex approach focuses on summarization on graph patterns
"SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization".
However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally
edge values and additionally stores a count aggregate at summarized vertices/edges.
Consider the following two examples
(e.g., social network with users from cities and friendships with timestamp):
h4. Input graph:
Vertices (id, value):
(0, Leipzig)
(1, Leipzig)
(2, Dresden)
(3, Dresden)
(4, Dresden)
(5, Berlin)
Edges (source, target, value):
(0, 1, 2014)
(1, 0, 2014)
(1, 2, 2013)
(2, 1, 2013)
(2, 3, 2014)
(3, 2, 2014)
(4, 0, 2013)
(4, 1, 2015)
(5, 2, 2015)
(5, 3, 2015)
h4. Output graph (summarized on vertex value):
Vertices (id, value, count)
(0, Leipzig, 2) // "2 users from Leipzig"
(2, Dresden, 3) // "3 users from Dresden"
(5, Berlin, 1) // "1 user from Berlin"
Edges (source, target, count)
(0, 0, 2) // "2 edges between users in Leipzig"
(0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
(2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
(2, 2, 2) // "2 edges between users in Dresden"
(5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
h4. Output graph (summarized on vertex and edge value):
Vertices (id, value, count)
(0, Leipzig, 2)
(2, Dresden, 3)
(5, Berlin, 1)
Edges (source, target, value, count)
(0, 0, 2014, 2) // ...
(0, 2, 2013, 1) // ...
(2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013"
(2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015"
(2, 2, 2014, 2) // ...
(5, 2, 2015, 2) // ...
I've already implemented two versions of the summarization algorithm in our own project https://github.com/dbsleipzig/gradoop,
which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You
can see the current WIP here:
1 [Abstract summarizationhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
2 [Implementation using crosshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
3 [Implementation using joinshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
4 [Testshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java
5 [TestGraphhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/devsupport/socialnetwork.pdf]
I would basically use the same implementation as in 3 in combination with KeySelectors to
select the grouping keys on vertices and edges.
As you can see in the example, each vertex in the resulting graph has a vertex id that is
contained in the original graph. This id is the smallest id among the grouped vertices (e.g.,
vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary
to correctly assign the summarized edges. Maybe there is a smarter way
to do it of which I did not think yet.
I would like contribute this to Flink and of course, if you have any suggestions/improvements
or do not want this at all (hopefully not), please let me know.

This message was sent by Atlassian JIRA
(v6.3.4#6332)
