flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Hermann <m...@gaborhermann.com>
Subject Re: [DISCUSS] Flink ML roadmap
Date Thu, 23 Feb 2017 12:05:38 GMT
I agree, that it's better to go in one direction first, but I think 
online and offline with streaming API can go somewhat parallel later. We 
could set a short-term goal, concentrate initially on one direction, and 
showcase that direction (e.g. in a blogpost). But first, we should list 
the pros/cons in a design doc as a minimum. Then make a decision what 
direction to go. Would that be feasible?

On 2017-02-23 12:34, Katherin Eri wrote:

> I'm not sure that this is feasible, doing all at the same time could mean
> doing nothing((((
> I'm just afraid, that words: we will work on streaming not on batching, we
> have no commiter's time for this, mean that yes, we started work on
> FLINK-1730, but nobody will commit this work in the end, as it already was
> with this ticket.
>
> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <mail@gaborhermann.com>
> написал:
>
>> @Theodore: Great to hear you think the "batch on streaming" approach is
>> possible! Of course, we need to pay attention all the pitfalls there, if we
>> go that way.
>>
>> +1 for a design doc!
>>
>> I would add that it's possible to make efforts in all the three directions
>> (i.e. batch, online, batch on streaming) at the same time. Although, it
>> might be worth to concentrate on one. E.g. it would not be so useful to
>> have the same batch algorithms with both the batch API and streaming API.
>> We can decide later.
>>
>> The design doc could be partitioned to these 3 directions, and we can
>> collect there the pros/cons too. What do you think?
>>
>> Cheers,
>> Gabor
>>
>>
>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>
>>> Hello all,
>>>
>>>
>>> @Gabor, we have discussed the idea of using the streaming API to write all
>>> of our ML algorithms with a couple of people offline,
>>> and I think it might be possible and is generally worth a shot. The
>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>> "online", but rather "fast-batch".
>>>
>>> There will be problems popping up again, even for very simple algos like
>>> on
>>> line linear regression with SGD [1], but hopefully fixing those will be
>>> more aligned with the priorities of the community.
>>>
>>> @Katherin, my understanding is that given the limited resources, there is
>>> no development effort focused on batch processing right now.
>>>
>>> So to summarize, it seems like there are people willing to work on ML on
>>> Flink, but nobody is sure how to do it.
>>> There are many directions we could take (batch, online, batch on
>>> streaming), each with its own merits and downsides.
>>>
>>> If you want we can start a design doc and move the conversation there,
>>> come
>>> up with a roadmap and start implementing.
>>>
>>> Regards,
>>> Theodore
>>>
>>> [1]
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>> nabble.com/Understanding-connected-streams-use-without-times
>>> tamps-td10241.html
>>>
>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <mail@gaborhermann.com>
>>> wrote:
>>>
>>> It's great to see so much activity in this discussion :)
>>>> I'll try to add my thoughts.
>>>>
>>>> I think building a developer community (Till's 2. point) can be slightly
>>>> separated from what features we should aim for (1. point) and showcasing
>>>> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm
>>>> sure we'll find a way to make the development process more dynamic. I'll
>>>> try to address the rest here.
>>>>
>>>> It's hard to choose directions between streaming and batch ML. As Theo
>>>> has
>>>> indicated, not much online ML is used in production, but Flink
>>>> concentrates
>>>> on streaming, so online ML would be a better fit for Flink. However, as
>>>> most of you argued, there's definite need for batch ML. But batch ML
>>>> seems
>>>> hard to achieve because there are blocking issues with persisting,
>>>> iteration paths etc. So it's no good either way.
>>>>
>>>> I propose a seemingly crazy solution: what if we developed batch
>>>> algorithms also with the streaming API? The batch API would clearly seem
>>>> more suitable for ML algorithms, but there a lot of benefits of this
>>>> approach too, so it's clearly worth considering. Flink also has the high
>>>> level vision of "streaming for everything" that would clearly fit this
>>>> case. What do you all think about this? Do you think this solution would
>>>> be
>>>> feasible? I would be happy to make a more elaborate proposal, but I push
>>>> my
>>>> main ideas here:
>>>>
>>>> 1) Simplifying by using one system
>>>> It could simplify the work of both the users and the developers. One
>>>> could
>>>> execute training once, or could execute it periodically e.g. by using
>>>> windows. Low-latency serving and training could be done in the same
>>>> system.
>>>> We could implement incremental algorithms, without any side inputs for
>>>> combining online learning (or predictions) with batch learning. Of
>>>> course,
>>>> all the logic describing these must be somehow implemented (e.g.
>>>> synchronizing predictions with training), but it should be easier to do
>>>> so
>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>
>>>> 2) Batch ML with the streaming API is not harder
>>>> Despite these benefits, it could seem harder to implement batch ML with
>>>> the streaming API, but in my opinion it's not. There are more flexible,
>>>> lower-level optimization potentials with the streaming API. Most
>>>> distributed ML algorithms use a lower-level model than the batch API
>>>> anyway, so sometimes it feels like forcing the algorithm logic into the
>>>> training API and tweaking it. Although we could not use the batch
>>>> primitives like join, we would have the E.g. in my experience with
>>>> implementing a distributed matrix factorization algorithm [1], I couldn't
>>>> do a simple optimization because of the limitations of the iteration API
>>>> [2]. Even if we pushed all the development effort to make the batch API
>>>> more suitable for ML there would be things we couldn't do. E.g. there are
>>>> approaches for updating a model iteratively without locks [3,4] (i.e.
>>>> somewhat asynchronously), and I don't see a clear way to implement such
>>>> algorithms with the batch API.
>>>>
>>>> 3) Streaming community (users and devs) benefit
>>>> The Flink streaming community in general would also benefit from this
>>>> direction. There are many features needed in the streaming API for ML to
>>>> work, but this is also true for the batch API. One really important is
>>>> the
>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>> mentioned
>>>> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus,
>>>> by improving the streaming API to allow ML algorithms, the streaming API
>>>> benefit too (which is important as they have a lot more production users
>>>> than the batch API).
>>>>
>>>> 4) Performance can be at least as good
>>>> I believe the same performance could be achieved with the streaming API
>>>> as
>>>> with the batch API. Streaming API is much closer to the runtime than the
>>>> batch API. For corner-cases, with runtime-layer optimizations of batch
>>>> API,
>>>> we could find a way to do the same (or similar) optimization for the
>>>> streaming API (see my previous point). Such case could be using managed
>>>> memory (and spilling to disk). There are also benefits by default, e.g.
>>>> we
>>>> would have a finer grained fault tolerance with the streaming API.
>>>>
>>>> 5) We could keep batch ML API
>>>> For the shorter term, we should not throw away all the algorithms
>>>> implemented with the batch API. By pushing forward the development with
>>>> side inputs we could make them usable with streaming API. Then, if the
>>>> library gains some popularity, we could replace the algorithms in the
>>>> batch
>>>> API with streaming ones, to avoid the performance costs of e.g. not being
>>>> able to persist.
>>>>
>>>> 6) General tools for implementing ML algorithms
>>>> Besides implementing algorithms one by one, we could give more general
>>>> tools for making it easier to implement algorithms. E.g. parameter server
>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>> similar
>>>> model to Flink streaming, we could look into that too. I think often when
>>>> deploying a production ML system, much more configuration and tweaking
>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>
>>>> 7) Showcasing
>>>> Showcasing this could be easier. We could say that we're doing batch ML
>>>> with a streaming API. That's interesting in its own. IMHO this
>>>> integration
>>>> is also a more approachable way towards end-to-end ML.
>>>>
>>>>
>>>> Thanks for reading so far :)
>>>>
>>>> [1] https://github.com/apache/flink/pull/2819
>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>> 13-final77.pdf
>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>> Scoped+Loops+and+Job+Termination
>>>> [6] https://github.com/apache/flink/pull/1668
>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>> Parameter-Server-implementation-td15880.html
>>>>
>>>> Cheers,
>>>> Gabor
>>>>
>>>>


Mime
View raw message