flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Hermann <m...@gaborhermann.com>
Subject Re: [DISCUSS] Flink ML roadmap
Date Thu, 23 Feb 2017 15:41:28 GMT
Okay, let's just aim for around the end of next week, but we can take 
more time to discuss if there's still a lot of ongoing activity. Keep 
the topic hot!

Thanks all for the enthusiasm :)


On 2017-02-23 16:17, Stavros Kontopoulos wrote:
> @Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
> a week may suit more people.
> What do you think all?
> I will contribute to the doc.
>
> +100 for having a co-ordinator + commiter.
>
> Thank you all for joining the discussion.
>
> Cheers,
> Stavros
>
> On Thu, Feb 23, 2017 at 4:48 PM, Gábor Hermann <mail@gaborhermann.com>
> wrote:
>
>> Okay, I've created a skeleton of the design doc for choosing a direction:
>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>>
>> Much of the pros/cons have already been discussed here, so I'll try to put
>> there all the arguments mentioned in this thread. Feel free to put there
>> more :)
>>
>> @Stavros: I agree we should take action fast. What about collecting our
>> thoughts in the doc by around Tuesday next week (28. February)? Then decide
>> on the direction and design a roadmap by around Friday (3. March)? Is that
>> feasible, or should it take more time?
>>
>> I think it will be necessary to have a shepherd, or even better a
>> committer, to be involved in at least reviewing and accepting the roadmap.
>> It would be best, if a committer coordinated all this.
>> @Theodore: Would you like to do the coordination?
>>
>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
>> Forward [1] that seem promising. There are companies already using Flink
>> for ML [2,3,4,5].
>>
>> [1] http://sf.flink-forward.org/program/sessions/
>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
>> eaming-vs-micro-batch-for-online-learning/
>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
>> arning-on-flink/
>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
>> ing-scenarios-with-flink/
>>
>> Cheers,
>> Gabor
>>
>>
>>
>> On 2017-02-23 15:19, Katherin Eri wrote:
>>
>>> I have asked already some teams for useful cases, but all of them need
>>> time
>>> to think.
>>> During analysis something will finally arise.
>>> May be we can ask partners of Flink  for cases? Data Artisans got results
>>> of customers survey: [1], ML better support is wanted, so we could ask
>>> what
>>> exactly is necessary.
>>>
>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>>
>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos"
<
>>> st.kontopoulos@gmail.com> написал:
>>>
>>> +100 for a design doc.
>>>> Could we also set a roadmap after some time-boxed investigation captured
>>>> in
>>>> that document? We need action.
>>>>
>>>> Looking forward to work on this (whatever that might be) ;) Also are
>>>> there
>>>> any data supporting one direction or the other from a customer
>>>> perspective?
>>>> It would help to make more informed decisions.
>>>>
>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinmail@gmail.com>
>>>> wrote:
>>>>
>>>> Yes, ok.
>>>>> let's start some design document, and write down there already mentioned
>>>>> ideas about: parameter server, about clipper and others. Would be nice
>>>>> if
>>>>> we will also map this approaches to cases.
>>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>>
>>>> form
>>>>
>>>>> some picture, that could be agreed with committers.
>>>>> @Gabor, could you please start such shared doc, as you have already
>>>>>
>>>> several
>>>>
>>>>> ideas proposed?
>>>>>
>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <mail@gaborhermann.com>:
>>>>>
>>>>> I agree, that it's better to go in one direction first, but I think
>>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>>
>>>>> We
>>>>> could set a short-term goal, concentrate initially on one direction,
>>>>> and
>>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>>>> the pros/cons in a design doc as a minimum. Then make a decision
what
>>>>>> direction to go. Would that be feasible?
>>>>>>
>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>
>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>> mean
>>>>>> doing nothing((((
>>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>>
>>>>>> batching,
>>>>> we
>>>>>>> have no commiter's time for this, mean that yes, we started work
on
>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>>
>>>>>> already
>>>>> was
>>>>>>> with this ticket.
>>>>>>>
>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor
Hermann" <
>>>>>>>
>>>>>> mail@gaborhermann.com>
>>>>>>
>>>>>>> написал:
>>>>>>>
>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>> is
>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>> there,
>>>>> if we
>>>>>>> go that way.
>>>>>>>> +1 for a design doc!
>>>>>>>>
>>>>>>>> I would add that it's possible to make efforts in all the
three
>>>>>>>>
>>>>>>> directions
>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>> it
>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>> to
>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>> API.
>>>>>>> We can decide later.
>>>>>>>> The design doc could be partitioned to these 3 directions,
and we
>>>>>>>>
>>>>>>> can
>>>>> collect there the pros/cons too. What do you think?
>>>>>>>> Cheers,
>>>>>>>> Gabor
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> @Gabor, we have discussed the idea of using the streaming
API to
>>>>>>>>>
>>>>>>>> write
>>>>>> all
>>>>>>
>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>>> and I think it might be possible and is generally worth
a shot. The
>>>>>>>>> approach we would take would be close to Vowpal Wabbit,
not exactly
>>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>>
>>>>>>>>> There will be problems popping up again, even for very
simple algos
>>>>>>>>>
>>>>>>>> like
>>>>>>> on
>>>>>>>>> line linear regression with SGD [1], but hopefully fixing
those
>>>>>>>>>
>>>>>>>> will
>>>>> be
>>>>>
>>>>>> more aligned with the priorities of the community.
>>>>>>>>> @Katherin, my understanding is that given the limited
resources,
>>>>>>>>>
>>>>>>>> there
>>>>>> is
>>>>>>
>>>>>>> no development effort focused on batch processing right now.
>>>>>>>>> So to summarize, it seems like there are people willing
to work on
>>>>>>>>>
>>>>>>>> ML
>>>>> on
>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>>> There are many directions we could take (batch, online,
batch on
>>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>>
>>>>>>>>> If you want we can start a design doc and move the conversation
>>>>>>>>>
>>>>>>>> there,
>>>>>> come
>>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Theodore
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>>> tamps-td10241.html
>>>>>>>>>
>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
>>>>>>>>>
>>>>>>>> mail@gaborhermann.com
>>>>>> wrote:
>>>>>>>>> It's great to see so much activity in this discussion
:)
>>>>>>>>>
>>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>>
>>>>>>>>>> I think building a developer community (Till's 2.
point) can be
>>>>>>>>>>
>>>>>>>>> slightly
>>>>>>> separated from what features we should aim for (1. point) and
>>>>>>>>> showcasing
>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>>>>>>> restructuring,
>>>>> I'm
>>>>>>> sure we'll find a way to make the development process more
>>>>>>>>> dynamic.
>>>>> I'll
>>>>>>> try to address the rest here.
>>>>>>>>>> It's hard to choose directions between streaming
and batch ML. As
>>>>>>>>>>
>>>>>>>>> Theo
>>>>>> has
>>>>>>>>>> indicated, not much online ML is used in production,
but Flink
>>>>>>>>>> concentrates
>>>>>>>>>> on streaming, so online ML would be a better fit
for Flink.
>>>>>>>>>>
>>>>>>>>> However,
>>>>> as
>>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>>>>>>> ML
>>>>> seems
>>>>>>>>>> hard to achieve because there are blocking issues
with persisting,
>>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>>
>>>>>>>>>> I propose a seemingly crazy solution: what if we
developed batch
>>>>>>>>>> algorithms also with the streaming API? The batch
API would
>>>>>>>>>>
>>>>>>>>> clearly
>>>>> seem
>>>>>>> more suitable for ML algorithms, but there a lot of benefits
of
>>>>>>>>> this
>>>>> approach too, so it's clearly worth considering. Flink also has
>>>>>>>>> the
>>>>> high
>>>>>>> level vision of "streaming for everything" that would clearly
fit
>>>>>>>>> this
>>>>>> case. What do you all think about this? Do you think this solution
>>>>>>>>> would
>>>>>>> be
>>>>>>>>>> feasible? I would be happy to make a more elaborate
proposal, but
>>>>>>>>>>
>>>>>>>>> I
>>>>> push
>>>>>>> my
>>>>>>>>>> main ideas here:
>>>>>>>>>>
>>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>>> It could simplify the work of both the users and
the developers.
>>>>>>>>>>
>>>>>>>>> One
>>>>> could
>>>>>>>>>> execute training once, or could execute it periodically
e.g. by
>>>>>>>>>>
>>>>>>>>> using
>>>>>> windows. Low-latency serving and training could be done in the
>>>>>>>>> same
>>>>> system.
>>>>>>>>>> We could implement incremental algorithms, without
any side inputs
>>>>>>>>>>
>>>>>>>>> for
>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>>> course,
>>>>>>>>>> all the logic describing these must be somehow implemented
(e.g.
>>>>>>>>>> synchronizing predictions with training), but it
should be easier
>>>>>>>>>>
>>>>>>>>> to
>>>>> do
>>>>>>> so
>>>>>>>>>> in one system, than by combining e.g. the batch and
streaming API.
>>>>>>>>>>
>>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>>> Despite these benefits, it could seem harder to implement
batch ML
>>>>>>>>>>
>>>>>>>>> with
>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>>>>> flexible,
>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>>> distributed ML algorithms use a lower-level model
than the batch
>>>>>>>>>>
>>>>>>>>> API
>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>>>>>>> into
>>>>> the
>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>>> primitives like join, we would have the E.g. in my
experience with
>>>>>>>>>> implementing a distributed matrix factorization algorithm
[1], I
>>>>>>>>>>
>>>>>>>>> couldn't
>>>>>>> do a simple optimization because of the limitations of the
>>>>>>>>> iteration
>>>>> API
>>>>>>> [2]. Even if we pushed all the development effort to make the
>>>>>>>>> batch
>>>>> API
>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>>>>>> there
>>>>>> are
>>>>>>
>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>>>>>> (i.e.
>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>>>>> such
>>>>>>> algorithms with the batch API.
>>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>>> The Flink streaming community in general would also
benefit from
>>>>>>>>>>
>>>>>>>>> this
>>>>>> direction. There are many features needed in the streaming API for
>>>>>>>>> ML
>>>>>> to
>>>>>>
>>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>> important
>>>>> is
>>>>>
>>>>>> the
>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There
has been a lot
>>>>>>>>>>
>>>>>>>>> of
>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>>> mentioned
>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming
generally
>>>>>>>>>>
>>>>>>>>> [7].
>>>>> Thus,
>>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>> streaming
>>>>> API
>>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>> users
>>>>>>> than the batch API).
>>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>>> I believe the same performance could be achieved
with the
>>>>>>>>>>
>>>>>>>>> streaming
>>>>> API
>>>>>>> as
>>>>>>>>>> with the batch API. Streaming API is much closer
to the runtime
>>>>>>>>>>
>>>>>>>>> than
>>>>> the
>>>>>>> batch API. For corner-cases, with runtime-layer optimizations
of
>>>>>>>>> batch
>>>>>> API,
>>>>>>>>>> we could find a way to do the same (or similar) optimization
for
>>>>>>>>>>
>>>>>>>>> the
>>>>> streaming API (see my previous point). Such case could be using
>>>>>>>>> managed
>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>>>>> e.g.
>>>>>>> we
>>>>>>>>>> would have a finer grained fault tolerance with the
streaming API.
>>>>>>>>>>
>>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>>> For the shorter term, we should not throw away all
the algorithms
>>>>>>>>>> implemented with the batch API. By pushing forward
the development
>>>>>>>>>>
>>>>>>>>> with
>>>>>>> side inputs we could make them usable with streaming API. Then,
if
>>>>>>>>> the
>>>>>> library gains some popularity, we could replace the algorithms in
>>>>>>>>> the
>>>>>> batch
>>>>>>>>>> API with streaming ones, to avoid the performance
costs of e.g.
>>>>>>>>>>
>>>>>>>>> not
>>>>> being
>>>>>>> able to persist.
>>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>>> Besides implementing algorithms one by one, we could
give more
>>>>>>>>>>
>>>>>>>>> general
>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>>>>> server
>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow
has a
>>>>>>>>>> similar
>>>>>>>>>> model to Flink streaming, we could look into that
too. I think
>>>>>>>>>>
>>>>>>>>> often
>>>>> when
>>>>>>> deploying a production ML system, much more configuration and
>>>>>>>>> tweaking
>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>>> 7) Showcasing
>>>>>>>>>> Showcasing this could be easier. We could say that
we're doing
>>>>>>>>>>
>>>>>>>>> batch
>>>>> ML
>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>>> integration
>>>>>>>>>> is also a more approachable way towards end-to-end
ML.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>>> 13-final77.pdf
>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>>
>>>>>>>>> pdf
>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Gabor
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>> *Yours faithfully, *
>>>>>
>>>>> *Kate Eri.*
>>>>>
>>>>>


Mime
View raw message