[ https://issues.apache.org/jira/browse/FLINK2411?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Martin Junghanns updated FLINK2411:

Description:
Graph summarization determines a structural grouping of similar vertices and edges to condense
a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used
in OLAPstyle operations on the graph and is similar to group by in SQL but on the graph structure
instead of rows.
The graph summarization operator represents every vertex group by a single vertex in the summarized
graph; edges between vertices in the summary graph represent a group of edges between the
vertex group members of the original graph. Summarization is defined by specifying grouping
keys for vertices and edges, respectively.
One publication that presents a Map/Reduce based approach is "Pagrol: Parallel graph olap
over largescale attributed graphs", however they precompute the graphcube before it can
be analyzed. With Flink, we can give the user an interactive way of summarizing the graph
and do not need to compute the cube beforehand.
A more complex approach focuses on summarization on graph patterns "SynopSys: Large Graph
Analytics in the SAP HANA Database Through Summarization".
However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally
edge values and additionally stores a count aggregate at summarized vertices/edges.
Consider the following two examples (e.g., social network with users from cities and friendships
with timestamp):
h4. Input graph:
Vertices (id, value):
(0, Leipzig)
(1, Leipzig)
(2, Dresden)
(3, Dresden)
(4, Dresden)
(5, Berlin)
Edges (source, target, value):
(0, 1, 2014)
(1, 0, 2014)
(1, 2, 2013)
(2, 1, 2013)
(2, 3, 2014)
(3, 2, 2014)
(4, 0, 2013)
(4, 1, 2015)
(5, 2, 2015)
(5, 3, 2015)
h4. Output graph (summarized on vertex value):
Vertices (id, value, count)
(0, Leipzig, 2) // "2 users from Leipzig"
(2, Dresden, 3) // "3 users from Dresden"
(5, Berlin, 1) // "1 user from Berlin"
Edges (source, target, count)
(0, 0, 2) // "2 edges between users in Leipzig"
(0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
(2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
(2, 2, 2) // "2 edges between users in Dresden"
(5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
h4. Output graph (summarized on vertex and edge value):
Vertices (id, value, count)
(0, Leipzig, 2)
(2, Dresden, 3)
(5, Berlin, 1)
Edges (source, target, value, count)
(0, 0, 2014, 2) // ...
(0, 2, 2013, 1) // ...
(2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013"
(2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015"
(2, 2, 2014, 2) // ...
(5, 2, 2015, 2) // ...
I've already implemented two versions of the summarization algorithm in our own project https://github.com/dbsleipzig/gradoop,
which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You
can see the current WIP here:
1 [Abstract summarizationhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
2 [Implementation using crosshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
3 [Implementation using joinshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
4 [Testshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java
5 [TestGraphhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/devsupport/socialnetwork.pdf]
I would basically use the same implementation as in 3 in combination with KeySelectors to
select the grouping keys on vertices and edges.
As you can see in the example, each vertex in the resulting graph has a vertex id that is
contained in the original graph. This id is the smallest id among the grouped vertices (e.g.,
vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary
to correctly assign the summarized edges. Maybe there is a smarter way
to do it of which I did not think yet.
I would like contribute this to Flink and of course, if you have any suggestions/improvements
or do not want this at all (hopefully not), please let me know.
was:
Graph summarization determines a structural grouping of similar vertices and edges to condense
a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used
in OLAPstyle operations on the graph and is similar to group by in SQL but on the graph structure
instead of rows.
The graph summarization operator represents every vertex group by a single vertex in the summarized
graph; edges between vertices in the summary graph represent a group of edges between the
vertex group members of the original graph. Summarization is defined by specifying grouping
keys for vertices and edges, respectively.
One publication that presents a Map/Reduce based approach is
"Pagrol: Parallel graph olap over largescale attributed graphs", however they precompute
the graphcube before it can be analyzed. With Flink, we can give the user an interactive
way of summarizing the graph and do not need to compute the cube beforehand.
A more complex approach focuses on summarization on graph patterns
"SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization".
However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally
edge values and additionally stores a count aggregate at summarized vertices/edges.
Consider the following two examples
(e.g., social network with users from cities and friendships with timestamp):
h4. Input graph:
Vertices (id, value):
(0, Leipzig)
(1, Leipzig)
(2, Dresden)
(3, Dresden)
(4, Dresden)
(5, Berlin)
Edges (source, target, value):
(0, 1, 2014)
(1, 0, 2014)
(1, 2, 2013)
(2, 1, 2013)
(2, 3, 2014)
(3, 2, 2014)
(4, 0, 2013)
(4, 1, 2015)
(5, 2, 2015)
(5, 3, 2015)
h4. Output graph (summarized on vertex value):
Vertices (id, value, count)
(0, Leipzig, 2) // "2 users from Leipzig"
(2, Dresden, 3) // "3 users from Dresden"
(5, Berlin, 1) // "1 user from Berlin"
Edges (source, target, count)
(0, 0, 2) // "2 edges between users in Leipzig"
(0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
(2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
(2, 2, 2) // "2 edges between users in Dresden"
(5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
h4. Output graph (summarized on vertex and edge value):
Vertices (id, value, count)
(0, Leipzig, 2)
(2, Dresden, 3)
(5, Berlin, 1)
Edges (source, target, value, count)
(0, 0, 2014, 2) // ...
(0, 2, 2013, 1) // ...
(2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013"
(2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015"
(2, 2, 2014, 2) // ...
(5, 2, 2015, 2) // ...
I've already implemented two versions of the summarization algorithm in our own project https://github.com/dbsleipzig/gradoop,
which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You
can see the current WIP here:
1 [Abstract summarizationhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
2 [Implementation using crosshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
3 [Implementation using joinshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
4 [Testshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java
5 [TestGraphhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/devsupport/socialnetwork.pdf]
I would basically use the same implementation as in 3 in combination with KeySelectors to
select the grouping keys on vertices and edges.
As you can see in the example, each vertex in the resulting graph has a vertex id that is
contained in the original graph. This id is the smallest id among the grouped vertices (e.g.,
vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary
to correctly assign the summarized edges. Maybe there is a smarter way
to do it of which I did not think yet.
I would like contribute this to Flink and of course, if you have any suggestions/improvements
or do not want this at all (hopefully not), please let me know.
> Add basic graph summarization algorithm
> 
>
> Key: FLINK2411
> URL: https://issues.apache.org/jira/browse/FLINK2411
> Project: Flink
> Issue Type: New Feature
> Components: Gelly
> Affects Versions: master
> Reporter: Martin Junghanns
> Priority: Minor
>
> Graph summarization determines a structural grouping of similar vertices and edges to
condense a graph and thus helps to uncover insights about patterns hidden in the graph. It
can be used in OLAPstyle operations on the graph and is similar to group by in SQL but on
the graph structure instead of rows.
>
> The graph summarization operator represents every vertex group by a single vertex in
the summarized graph; edges between vertices in the summary graph represent a group of edges
between the vertex group members of the original graph. Summarization is defined by specifying
grouping keys for vertices and edges, respectively.
> One publication that presents a Map/Reduce based approach is "Pagrol: Parallel graph
olap over largescale attributed graphs", however they precompute the graphcube before it
can be analyzed. With Flink, we can give the user an interactive way of summarizing the graph
and do not need to compute the cube beforehand.
> A more complex approach focuses on summarization on graph patterns "SynopSys: Large
Graph Analytics in the SAP HANA Database Through Summarization".
> However, I want to start with a simple algorithm that summarizes the graph on vertex
and optionally edge values and additionally stores a count aggregate at summarized vertices/edges.
> Consider the following two examples (e.g., social network with users from cities and
friendships with timestamp):
>
> h4. Input graph:
>
> Vertices (id, value):
> (0, Leipzig)
> (1, Leipzig)
> (2, Dresden)
> (3, Dresden)
> (4, Dresden)
> (5, Berlin)
> Edges (source, target, value):
> (0, 1, 2014)
> (1, 0, 2014)
> (1, 2, 2013)
> (2, 1, 2013)
> (2, 3, 2014)
> (3, 2, 2014)
> (4, 0, 2013)
> (4, 1, 2015)
> (5, 2, 2015)
> (5, 3, 2015)
> h4. Output graph (summarized on vertex value):
> Vertices (id, value, count)
> (0, Leipzig, 2) // "2 users from Leipzig"
> (2, Dresden, 3) // "3 users from Dresden"
> (5, Berlin, 1) // "1 user from Berlin"
> Edges (source, target, count)
> (0, 0, 2) // "2 edges between users in Leipzig"
> (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
> (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
> (2, 2, 2) // "2 edges between users in Dresden"
> (5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
> h4. Output graph (summarized on vertex and edge value):
> Vertices (id, value, count)
> (0, Leipzig, 2)
> (2, Dresden, 3)
> (5, Berlin, 1)
> Edges (source, target, value, count)
> (0, 0, 2014, 2) // ...
> (0, 2, 2013, 1) // ...
> (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp
2013"
> (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015"
> (2, 2, 2014, 2) // ...
> (5, 2, 2015, 2) // ...
> I've already implemented two versions of the summarization algorithm in our own project
https://github.com/dbsleipzig/gradoop,
> which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model.
You can see the current WIP here:
> 1 [Abstract summarizationhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
> 2 [Implementation using crosshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
> 3 [Implementation using joinshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
> 4 [Testshttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/gradoopflink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java
> 5 [TestGraphhttps://github.com/dbsleipzig/gradoop/blob/%2345_gradoop_flink/devsupport/socialnetwork.pdf]
> I would basically use the same implementation as in 3 in combination with KeySelectors
to select the grouping keys on vertices and edges.
> As you can see in the example, each vertex in the resulting graph has a vertex id that
is contained in the original graph. This id is the smallest id among the grouped vertices
(e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter
is necessary to correctly assign the summarized edges. Maybe there is a smarter way
> to do it of which I did not think yet.
> I would like contribute this to Flink and of course, if you have any suggestions/improvements
or do not want this at all (hopefully not), please let me know.

This message was sent by Atlassian JIRA
(v6.3.4#6332)
