flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject a typical ML algorithm flow
Date Tue, 22 Mar 2016 20:56:37 GMT

probably more of a question for Till:

Imagine a common ML algorithm flow that runs until convergence.

typical distributed flow would be something like that (e.g. GMM EM would be
exactly like that):

A: input

do {

   stat1 = A.map.reduce
   A = A.update-map(stat1)
   conv = A.map.reduce
} until conv > convThreshold

There probably could be 1 map-reduce step originating on A to compute both
convergence criteria statistics and udpate statistics in one step. not the

The point is that update and map.reduce originate on the same dataset

In spark we would normally commit A to a object tree cache so that data is
available to subsequent map passes without any I/O or serialization
operations, thus insuring high rate of iterations.

We observe the same pattern pretty much everywhere. clustering,
probabilistic algorithms, even batch gradient descent of quasi newton
algorithms fitting.

How do we do something like that, for example, in FlinkML?




  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message