Hi,
probably more of a question for Till:
Imagine a common ML algorithm flow that runs until convergence.
typical distributed flow would be something like that (e.g. GMM EM would be
exactly like that):
A: input
do {
stat1 = A.map.reduce
A = A.updatemap(stat1)
conv = A.map.reduce
} until conv > convThreshold
There probably could be 1 mapreduce step originating on A to compute both
convergence criteria statistics and udpate statistics in one step. not the
point.
The point is that update and map.reduce originate on the same dataset
intermittently.
In spark we would normally commit A to a object tree cache so that data is
available to subsequent map passes without any I/O or serialization
operations, thus insuring high rate of iterations.
We observe the same pattern pretty much everywhere. clustering,
probabilistic algorithms, even batch gradient descent of quasi newton
algorithms fitting.
How do we do something like that, for example, in FlinkML?
Thoughts?
thanks.
Dmitriy
