flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Katherin Eri <katherinm...@gmail.com>
Subject Re: [DISCUSS] Flink ML roadmap
Date Thu, 23 Feb 2017 11:34:12 GMT
I'm not sure that this is feasible, doing all at the same time could mean
doing nothing((((
I'm just afraid, that words: we will work on streaming not on batching, we
have no commiter's time for this, mean that yes, we started work on
FLINK-1730, but nobody will commit this work in the end, as it already was
with this ticket.

23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <mail@gaborhermann.com>

> @Theodore: Great to hear you think the "batch on streaming" approach is
> possible! Of course, we need to pay attention all the pitfalls there, if we
> go that way.
> +1 for a design doc!
> I would add that it's possible to make efforts in all the three directions
> (i.e. batch, online, batch on streaming) at the same time. Although, it
> might be worth to concentrate on one. E.g. it would not be so useful to
> have the same batch algorithms with both the batch API and streaming API.
> We can decide later.
> The design doc could be partitioned to these 3 directions, and we can
> collect there the pros/cons too. What do you think?
> Cheers,
> Gabor
> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>> Hello all,
>> @Gabor, we have discussed the idea of using the streaming API to write all
>> of our ML algorithms with a couple of people offline,
>> and I think it might be possible and is generally worth a shot. The
>> approach we would take would be close to Vowpal Wabbit, not exactly
>> "online", but rather "fast-batch".
>> There will be problems popping up again, even for very simple algos like
>> on
>> line linear regression with SGD [1], but hopefully fixing those will be
>> more aligned with the priorities of the community.
>> @Katherin, my understanding is that given the limited resources, there is
>> no development effort focused on batch processing right now.
>> So to summarize, it seems like there are people willing to work on ML on
>> Flink, but nobody is sure how to do it.
>> There are many directions we could take (batch, online, batch on
>> streaming), each with its own merits and downsides.
>> If you want we can start a design doc and move the conversation there,
>> come
>> up with a roadmap and start implementing.
>> Regards,
>> Theodore
>> [1]
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Understanding-connected-streams-use-without-times
>> tamps-td10241.html
>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <mail@gaborhermann.com>
>> wrote:
>> It's great to see so much activity in this discussion :)
>>> I'll try to add my thoughts.
>>> I think building a developer community (Till's 2. point) can be slightly
>>> separated from what features we should aim for (1. point) and showcasing
>>> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm
>>> sure we'll find a way to make the development process more dynamic. I'll
>>> try to address the rest here.
>>> It's hard to choose directions between streaming and batch ML. As Theo
>>> has
>>> indicated, not much online ML is used in production, but Flink
>>> concentrates
>>> on streaming, so online ML would be a better fit for Flink. However, as
>>> most of you argued, there's definite need for batch ML. But batch ML
>>> seems
>>> hard to achieve because there are blocking issues with persisting,
>>> iteration paths etc. So it's no good either way.
>>> I propose a seemingly crazy solution: what if we developed batch
>>> algorithms also with the streaming API? The batch API would clearly seem
>>> more suitable for ML algorithms, but there a lot of benefits of this
>>> approach too, so it's clearly worth considering. Flink also has the high
>>> level vision of "streaming for everything" that would clearly fit this
>>> case. What do you all think about this? Do you think this solution would
>>> be
>>> feasible? I would be happy to make a more elaborate proposal, but I push
>>> my
>>> main ideas here:
>>> 1) Simplifying by using one system
>>> It could simplify the work of both the users and the developers. One
>>> could
>>> execute training once, or could execute it periodically e.g. by using
>>> windows. Low-latency serving and training could be done in the same
>>> system.
>>> We could implement incremental algorithms, without any side inputs for
>>> combining online learning (or predictions) with batch learning. Of
>>> course,
>>> all the logic describing these must be somehow implemented (e.g.
>>> synchronizing predictions with training), but it should be easier to do
>>> so
>>> in one system, than by combining e.g. the batch and streaming API.
>>> 2) Batch ML with the streaming API is not harder
>>> Despite these benefits, it could seem harder to implement batch ML with
>>> the streaming API, but in my opinion it's not. There are more flexible,
>>> lower-level optimization potentials with the streaming API. Most
>>> distributed ML algorithms use a lower-level model than the batch API
>>> anyway, so sometimes it feels like forcing the algorithm logic into the
>>> training API and tweaking it. Although we could not use the batch
>>> primitives like join, we would have the E.g. in my experience with
>>> implementing a distributed matrix factorization algorithm [1], I couldn't
>>> do a simple optimization because of the limitations of the iteration API
>>> [2]. Even if we pushed all the development effort to make the batch API
>>> more suitable for ML there would be things we couldn't do. E.g. there are
>>> approaches for updating a model iteratively without locks [3,4] (i.e.
>>> somewhat asynchronously), and I don't see a clear way to implement such
>>> algorithms with the batch API.
>>> 3) Streaming community (users and devs) benefit
>>> The Flink streaming community in general would also benefit from this
>>> direction. There are many features needed in the streaming API for ML to
>>> work, but this is also true for the batch API. One really important is
>>> the
>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>> mentioned
>>> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus,
>>> by improving the streaming API to allow ML algorithms, the streaming API
>>> benefit too (which is important as they have a lot more production users
>>> than the batch API).
>>> 4) Performance can be at least as good
>>> I believe the same performance could be achieved with the streaming API
>>> as
>>> with the batch API. Streaming API is much closer to the runtime than the
>>> batch API. For corner-cases, with runtime-layer optimizations of batch
>>> API,
>>> we could find a way to do the same (or similar) optimization for the
>>> streaming API (see my previous point). Such case could be using managed
>>> memory (and spilling to disk). There are also benefits by default, e.g.
>>> we
>>> would have a finer grained fault tolerance with the streaming API.
>>> 5) We could keep batch ML API
>>> For the shorter term, we should not throw away all the algorithms
>>> implemented with the batch API. By pushing forward the development with
>>> side inputs we could make them usable with streaming API. Then, if the
>>> library gains some popularity, we could replace the algorithms in the
>>> batch
>>> API with streaming ones, to avoid the performance costs of e.g. not being
>>> able to persist.
>>> 6) General tools for implementing ML algorithms
>>> Besides implementing algorithms one by one, we could give more general
>>> tools for making it easier to implement algorithms. E.g. parameter server
>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>> similar
>>> model to Flink streaming, we could look into that too. I think often when
>>> deploying a production ML system, much more configuration and tweaking
>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>> 7) Showcasing
>>> Showcasing this could be easier. We could say that we're doing batch ML
>>> with a streaming API. That's interesting in its own. IMHO this
>>> integration
>>> is also a more approachable way towards end-to-end ML.
>>> Thanks for reading so far :)
>>> [1] https://github.com/apache/flink/pull/2819
>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>> 13-final77.pdf
>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>> Scoped+Loops+and+Job+Termination
>>> [6] https://github.com/apache/flink/pull/1668
>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>> Parameter-Server-implementation-td15880.html
>>> Cheers,
>>> Gabor

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message