Hi Kant,
As far as I know, I think the current example connected components implementation
based on DataSet API could not be extended to streaming data or incremental batch directly.
From the algorithm's perspective, if the graph only add edge and never remove
edge, I think the connected components should be able to be updated incrementally when the
graph changes: When some edges are added, a new search should be started from the sources
of the added edges to propagate its component ID. This will trigger a new pass of update of
the following vertices, and the updates continues until no vertices' component ID get updated.
However, if there are also edge removes, I think the incremental computation should not be
easily achieved.
To implement the above logic on Flink, I think currently there should be two
possible methods:
1) Use DataSet API and DataSet iteration, maintains the graph structure
and the latest computation result in a storage, and whenever there are enough changes to the
graph, submits a new DataSet job to recompute the result. The job should load the edges, the
latest component id and whether it is the source of the newly added edges for each graph vertex,
and then start the above incremental computation logic.
2) Flink also provide DataStream iteration API[1] that enables iterating
on the unbounded data. In this case the graph modification should be modeled as a datastream,
and some operators inside the iteration should maintain the graph structure and current component
id. whenever there are enough changes, it starts a new pass of computation.
Best,
Yun
[1] Flink DataStream iteration, https://ci.apache.org/projects/flink/flinkdocsrelease1.10/dev/datastream_api.html#iterations

From:kant kodali <kanth909@gmail.com>
Send Time:2020 Feb. 18 (Tue.) 15:11
To:user <user@flink.apache.org>
Subject:Can Connected Components run on a streaming dataset using iterate delta?
Hi All,
I am wondering if connected components can run on a streaming data? or say incremental batch?
I see that with delta iteration not all vertices need to participate at every iteration which
is great but in my case the graph is evolving over time other words new edges are getting
added over time. If so, does the algorithm needs to run on the entire graph or can it simply
run on the new batch of edges?
Finally, What is the best way to compute connected components on Graphs evolving over time?
Should I use streaming or batch or any custom incremental approach? Also, the documentation
take maxIterations as an input. How can I come up with a good number for max iterations? and
once I come up with a good number for max Iterations is the algorithm guaranteed to converge?
Thanks,
Kant
