Hello beautiful Flink people,
during the past few days, Andra and I have been discussing about how to
extend Gelly's iteration methods.
Alexander's course (and his awesome students) has made it obvious that
vertexcentric iterations are not the best fit for algorithms which don't
follow the common "propagateupdate" pattern. For example, Andra is working
on an implementation of Minimum Spanning Tree, which requires branching
inside an iteration and also requires a convergence check of an internal
iteration. Others also reported similar issues [1, 2]. Trying to fit such
algorithms to the vertexcentric model leads to long and ugly code, e.g.
aggregators to keep track of algorithm phases, duplicating data, etc.
One limitation of the vertexcentric and the upcoming GAS model is that
they both only allow the vertex values to be updated in each iteration.
However, for some algorithms we need to update the edge values and in
others we need to update both. In even more complex situations (like
Andra's MST) in some iterations we need to update the vertex values and in
some iterations we need to update the edge values.
Another problem is that we currently don't have a way to allow different
computational phases inside an iteration. This is something that Giraph
solves with master compute, a function that is executed once before each
superstep and sets the computation function.
All that said, I believe that we can solve most of these issues if we
nicely expose Flink's iteration operators in Gelly. I can see the following
cases:
1. Bulk & delta iterations where the solution set is the vertex dataset:
this will be similar to vertexcentric and GAS, but will allow more
flexible dataflows inside the iteration.
2. Bulk & delta iterations where the solution set is the edge dataset: for
the cases where we need to update edge values.
3. Bulk & delta iterations where the solution set is the Graph: this will
cover more complex cases, where the algorithm updates both vertices and
edges or even adds/removes vertices/edges, i.e. updates the whole Graph.
What do you think? I can see 1 & 2 being very easy to implement, but I
suspect 3 won't be that easy (but so awesome to have ^^).
Would it work the way a Graph is represented now, i.e. with 2 DataSets?
Any comment, idea, pointer would be much appreciated! Thank you ^^
Cheers,
V.
[1]:
http://apacheflinkincubatorusermailinglistarchive.2336050.n4.nabble.com/CanamasterclasscontrolthesuperstepinFlinkSpargeltd733.html
[2]:
http://issues.apache.org/jira/browse/FLINK1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel#comment14325769
