flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Hermann <m...@gaborhermann.com>
Subject Re: [DISCUSS] Flink ML roadmap
Date Thu, 23 Feb 2017 16:21:28 GMT
@Theodore, thanks for taking lead in the coordination :)

Let's see what we can do, and then decide what should start out as an 
independent project, or strictly inside Flink.
I agree that something experimental like batch ML on streaming would 
probably benefit more an independent repo first.

On 2017-02-23 16:56, Theodore Vasiloudis wrote:

> Sure having a deadline for March 3rd is fine. I can act as coordinator,
> trying to guide the discussion to concrete results.
>
> For committers it's up to their discretion and time if one wants to
> participate. I don't think it's necessary to have one, but it would be most
> welcome.
>
> @Katherin I would suggest you start a topic on the list about FLINK-1730,
> if it takes a lot of development effort from your side it's best to at
> least try to gauge the community's interest, and whether there will be
> motivation to merge the changes.
>
> Maybe at the end of this we have a FLIP we can submit, that's probably the
> way forward if we want to keep this effort within the project. For a new,
> highly experimental project like batch ML on streaming I would actually
> favor developing on an independent repo, which can later be merged into
> main if there is interest.
>
> Regards.
> Theodore
>
> On Thu, Feb 23, 2017 at 4:41 PM, Gábor Hermann <mail@gaborhermann.com>
> wrote:
>
>> Okay, let's just aim for around the end of next week, but we can take more
>> time to discuss if there's still a lot of ongoing activity. Keep the topic
>> hot!
>>
>> Thanks all for the enthusiasm :)
>>
>>
>>
>> On 2017-02-23 16:17, Stavros Kontopoulos wrote:
>>
>>> @Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
>>> a week may suit more people.
>>> What do you think all?
>>> I will contribute to the doc.
>>>
>>> +100 for having a co-ordinator + commiter.
>>>
>>> Thank you all for joining the discussion.
>>>
>>> Cheers,
>>> Stavros
>>>
>>> On Thu, Feb 23, 2017 at 4:48 PM, Gábor Hermann <mail@gaborhermann.com>
>>> wrote:
>>>
>>> Okay, I've created a skeleton of the design doc for choosing a direction:
>>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
>>>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>>>>
>>>> Much of the pros/cons have already been discussed here, so I'll try to
>>>> put
>>>> there all the arguments mentioned in this thread. Feel free to put there
>>>> more :)
>>>>
>>>> @Stavros: I agree we should take action fast. What about collecting our
>>>> thoughts in the doc by around Tuesday next week (28. February)? Then
>>>> decide
>>>> on the direction and design a roadmap by around Friday (3. March)? Is
>>>> that
>>>> feasible, or should it take more time?
>>>>
>>>> I think it will be necessary to have a shepherd, or even better a
>>>> committer, to be involved in at least reviewing and accepting the
>>>> roadmap.
>>>> It would be best, if a committer coordinated all this.
>>>> @Theodore: Would you like to do the coordination?
>>>>
>>>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
>>>> Forward [1] that seem promising. There are companies already using Flink
>>>> for ML [2,3,4,5].
>>>>
>>>> [1] http://sf.flink-forward.org/program/sessions/
>>>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
>>>> eaming-vs-micro-batch-for-online-learning/
>>>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
>>>> nsorflow/
>>>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
>>>> arning-on-flink/
>>>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
>>>> ing-scenarios-with-flink/
>>>>
>>>> Cheers,
>>>> Gabor
>>>>
>>>>
>>>>
>>>> On 2017-02-23 15:19, Katherin Eri wrote:
>>>>
>>>> I have asked already some teams for useful cases, but all of them need
>>>>> time
>>>>> to think.
>>>>> During analysis something will finally arise.
>>>>> May be we can ask partners of Flink  for cases? Data Artisans got
>>>>> results
>>>>> of customers survey: [1], ML better support is wanted, so we could ask
>>>>> what
>>>>> exactly is necessary.
>>>>>
>>>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>>>>
>>>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos"
<
>>>>> st.kontopoulos@gmail.com> написал:
>>>>>
>>>>> +100 for a design doc.
>>>>>
>>>>>> Could we also set a roadmap after some time-boxed investigation
>>>>>> captured
>>>>>> in
>>>>>> that document? We need action.
>>>>>>
>>>>>> Looking forward to work on this (whatever that might be) ;) Also
are
>>>>>> there
>>>>>> any data supporting one direction or the other from a customer
>>>>>> perspective?
>>>>>> It would help to make more informed decisions.
>>>>>>
>>>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinmail@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Yes, ok.
>>>>>>
>>>>>>> let's start some design document, and write down there already
>>>>>>> mentioned
>>>>>>> ideas about: parameter server, about clipper and others. Would
be nice
>>>>>>> if
>>>>>>> we will also map this approaches to cases.
>>>>>>> Will work on it collaboratively on each topic, may be finally
we will
>>>>>>>
>>>>>>> form
>>>>>> some picture, that could be agreed with committers.
>>>>>>> @Gabor, could you please start such shared doc, as you have already
>>>>>>>
>>>>>>> several
>>>>>> ideas proposed?
>>>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <mail@gaborhermann.com>:
>>>>>>>
>>>>>>> I agree, that it's better to go in one direction first, but I
think
>>>>>>>
>>>>>>>> online and offline with streaming API can go somewhat parallel
later.
>>>>>>>>
>>>>>>>> We
>>>>>>> could set a short-term goal, concentrate initially on one direction,
>>>>>>> and
>>>>>>> showcase that direction (e.g. in a blogpost). But first, we should
>>>>>>> list
>>>>>>>
>>>>>>>> the pros/cons in a design doc as a minimum. Then make a decision
what
>>>>>>>> direction to go. Would that be feasible?
>>>>>>>>
>>>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>>>
>>>>>>>> I'm not sure that this is feasible, doing all at the same
time could
>>>>>>>> mean
>>>>>>>> doing nothing((((
>>>>>>>>
>>>>>>>>> I'm just afraid, that words: we will work on streaming
not on
>>>>>>>>>
>>>>>>>>> batching,
>>>>>>> we
>>>>>>>
>>>>>>>> have no commiter's time for this, mean that yes, we started
work on
>>>>>>>>> FLINK-1730, but nobody will commit this work in the end,
as it
>>>>>>>>>
>>>>>>>>> already
>>>>>>> was
>>>>>>>
>>>>>>>> with this ticket.
>>>>>>>>> 23 февр. 2017 г. 14:26 пользователь
"Gábor Hermann" <
>>>>>>>>>
>>>>>>>>> mail@gaborhermann.com>
>>>>>>>> написал:
>>>>>>>>> @Theodore: Great to hear you think the "batch on streaming"
approach
>>>>>>>>> is
>>>>>>>>>
>>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>>>
>>>>>>>>> there,
>>>>>>>>>
>>>>>>>> if we
>>>>>>>> go that way.
>>>>>>>>>> +1 for a design doc!
>>>>>>>>>>
>>>>>>>>>> I would add that it's possible to make efforts in
all the three
>>>>>>>>>>
>>>>>>>>>> directions
>>>>>>>>> (i.e. batch, online, batch on streaming) at the same
time. Although,
>>>>>>>>> it
>>>>>>>>>
>>>>>>>> might be worth to concentrate on one. E.g. it would not be
so useful
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>
>>>>>>>> have the same batch algorithms with both the batch API and
streaming
>>>>>>>>
>>>>>>>>> API.
>>>>>>>>> We can decide later.
>>>>>>>>>
>>>>>>>>>> The design doc could be partitioned to these 3 directions,
and we
>>>>>>>>>>
>>>>>>>>>> can
>>>>>>>> collect there the pros/cons too. What do you think?
>>>>>>>> Cheers,
>>>>>>>>>> Gabor
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>>> @Gabor, we have discussed the idea of using the
streaming API to
>>>>>>>>>>>
>>>>>>>>>>> write
>>>>>>>>> all
>>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>>>> and I think it might be possible and is generally
worth a shot. The
>>>>>>>>>>> approach we would take would be close to Vowpal
Wabbit, not
>>>>>>>>>>> exactly
>>>>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>>>>
>>>>>>>>>>> There will be problems popping up again, even
for very simple
>>>>>>>>>>> algos
>>>>>>>>>>>
>>>>>>>>>>> like
>>>>>>>>> on
>>>>>>>>>
>>>>>>>>>> line linear regression with SGD [1], but hopefully
fixing those
>>>>>>>>>>> will
>>>>>>>>> be
>>>>>>> more aligned with the priorities of the community.
>>>>>>>>> @Katherin, my understanding is that given the limited
resources,
>>>>>>>>>>> there
>>>>>>>>> is
>>>>>>>> no development effort focused on batch processing right now.
>>>>>>>>>> So to summarize, it seems like there are people willing
to work on
>>>>>>>>>>> ML
>>>>>>>>> on
>>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>>>> There are many directions we could take (batch, online,
batch on
>>>>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>>>>
>>>>>>>>>>> If you want we can start a design doc and move
the conversation
>>>>>>>>>>>
>>>>>>>>>>> there,
>>>>>>>>> come
>>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>>>> Regards,
>>>>>>>>>>> Theodore
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>>>>> tamps-td10241.html
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann
<
>>>>>>>>>>>
>>>>>>>>>>> mail@gaborhermann.com
>>>>>>>>> wrote:
>>>>>>>>> It's great to see so much activity in this discussion
:)
>>>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>>>> I think building a developer community (Till's
2. point) can be
>>>>>>>>>>>>
>>>>>>>>>>>> slightly
>>>>>>>>>> separated from what features we should aim for (1.
point) and
>>>>>>>>>> showcasing
>>>>>>>>>> (3. point). Thanks Till for bringing up the ideas
for
>>>>>>>>>> restructuring,
>>>>>>>>>> I'm
>>>>>>>> sure we'll find a way to make the development process more
>>>>>>>>>> dynamic.
>>>>>>>>>> I'll
>>>>>>>> try to address the rest here.
>>>>>>>>>> It's hard to choose directions between streaming
and batch ML. As
>>>>>>>>>>>> Theo
>>>>>>>>>> has
>>>>>>>>> indicated, not much online ML is used in production,
but Flink
>>>>>>>>>>>> concentrates
>>>>>>>>>>>> on streaming, so online ML would be a better
fit for Flink.
>>>>>>>>>>>>
>>>>>>>>>>>> However,
>>>>>>>>>> as
>>>>>>>> most of you argued, there's definite need for batch ML. But
batch
>>>>>>>>>> ML
>>>>>>>>>> seems
>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>>>>> iteration paths etc. So it's no good either
way.
>>>>>>>>>>>>
>>>>>>>>>>>> I propose a seemingly crazy solution: what
if we developed batch
>>>>>>>>>>>> algorithms also with the streaming API? The
batch API would
>>>>>>>>>>>>
>>>>>>>>>>>> clearly
>>>>>>>>>> seem
>>>>>>>> more suitable for ML algorithms, but there a lot of benefits
of
>>>>>>>>>> this
>>>>>>>>>> approach too, so it's clearly worth considering.
Flink also has
>>>>>>>> the
>>>>>>>>>> high
>>>>>>>> level vision of "streaming for everything" that would clearly
fit
>>>>>>>>>> this
>>>>>>>>>> case. What do you all think about this? Do you think
this solution
>>>>>>>>> would
>>>>>>>>>> be
>>>>>>>>>> feasible? I would be happy to make a more elaborate
proposal, but
>>>>>>>>>>>> I
>>>>>>>>>> push
>>>>>>>> my
>>>>>>>>>> main ideas here:
>>>>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>>>>> It could simplify the work of both the users
and the developers.
>>>>>>>>>>>>
>>>>>>>>>>>> One
>>>>>>>>>> could
>>>>>>>> execute training once, or could execute it periodically e.g.
by
>>>>>>>>>>>> using
>>>>>>>>>> windows. Low-latency serving and training could be
done in the
>>>>>>>>> same
>>>>>>>>>> system.
>>>>>>>> We could implement incremental algorithms, without any side
inputs
>>>>>>>>>>>> for
>>>>>>>>>> combining online learning (or predictions) with batch
learning. Of
>>>>>>>>> course,
>>>>>>>>>>>> all the logic describing these must be somehow
implemented (e.g.
>>>>>>>>>>>> synchronizing predictions with training),
but it should be easier
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>> do
>>>>>>>> so
>>>>>>>>>> in one system, than by combining e.g. the batch and
streaming API.
>>>>>>>>>>>> 2) Batch ML with the streaming API is not
harder
>>>>>>>>>>>> Despite these benefits, it could seem harder
to implement batch
>>>>>>>>>>>> ML
>>>>>>>>>>>>
>>>>>>>>>>>> with
>>>>>>>>>> the streaming API, but in my opinion it's not. There
are more
>>>>>>>>>> flexible,
>>>>>>>>>> lower-level optimization potentials with the streaming
API. Most
>>>>>>>>>> distributed ML algorithms use a lower-level model
than the batch
>>>>>>>>>>>> API
>>>>>>>>>> anyway, so sometimes it feels like forcing the algorithm
logic
>>>>>>>> into
>>>>>>>>>> the
>>>>>>>> training API and tweaking it. Although we could not use the
batch
>>>>>>>>>> primitives like join, we would have the E.g. in my
experience with
>>>>>>>>>>>> implementing a distributed matrix factorization
algorithm [1], I
>>>>>>>>>>>>
>>>>>>>>>>>> couldn't
>>>>>>>>>> do a simple optimization because of the limitations
of the
>>>>>>>>>> iteration
>>>>>>>>>> API
>>>>>>>> [2]. Even if we pushed all the development effort to make
the
>>>>>>>>>> batch
>>>>>>>>>> API
>>>>>>>> more suitable for ML there would be things we couldn't do.
E.g.
>>>>>>>>>> there
>>>>>>>>>> are
>>>>>>>> approaches for updating a model iteratively without locks
[3,4]
>>>>>>>>>> (i.e.
>>>>>>>>>> somewhat asynchronously), and I don't see a clear
way to implement
>>>>>>>>> such
>>>>>>>>>> algorithms with the batch API.
>>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>>>>> The Flink streaming community in general
would also benefit from
>>>>>>>>>>>>
>>>>>>>>>>>> this
>>>>>>>>>> direction. There are many features needed in the
streaming API for
>>>>>>>>> ML
>>>>>>>>>> to
>>>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>>> important
>>>>>>>>>> is
>>>>>>> the
>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has
been a lot
>>>>>>>>>>>> of
>>>>>>>>>> effort (mostly from Paris) for making it mature enough
[6]. Kate
>>>>>>>>> mentioned
>>>>>>>>>>>> using GPUs, and I'm sure they have uses in
streaming generally
>>>>>>>>>>>>
>>>>>>>>>>>> [7].
>>>>>>>>>> Thus,
>>>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>>> streaming
>>>>>>>>>> API
>>>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>>> users
>>>>>>>>>> than the batch API).
>>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>>>>> I believe the same performance could be achieved
with the
>>>>>>>>>>>>
>>>>>>>>>>>> streaming
>>>>>>>>>> API
>>>>>>>> as
>>>>>>>>>> with the batch API. Streaming API is much closer
to the runtime
>>>>>>>>>>>> than
>>>>>>>>>> the
>>>>>>>> batch API. For corner-cases, with runtime-layer optimizations
of
>>>>>>>>>> batch
>>>>>>>>>> API,
>>>>>>>>> we could find a way to do the same (or similar) optimization
for
>>>>>>>>>>>> the
>>>>>>>>>> streaming API (see my previous point). Such case
could be using
>>>>>>>> managed
>>>>>>>>>> memory (and spilling to disk). There are also benefits
by default,
>>>>>>>>>> e.g.
>>>>>>>>>> we
>>>>>>>>>> would have a finer grained fault tolerance with the
streaming API.
>>>>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>>>>> For the shorter term, we should not throw
away all the algorithms
>>>>>>>>>>>> implemented with the batch API. By pushing
forward the
>>>>>>>>>>>> development
>>>>>>>>>>>>
>>>>>>>>>>>> with
>>>>>>>>>> side inputs we could make them usable with streaming
API. Then, if
>>>>>>>>>> the
>>>>>>>>>> library gains some popularity, we could replace the
algorithms in
>>>>>>>>> the
>>>>>>>>>> batch
>>>>>>>>> API with streaming ones, to avoid the performance costs
of e.g.
>>>>>>>>>>>> not
>>>>>>>>>> being
>>>>>>>> able to persist.
>>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>>>>> Besides implementing algorithms one by one,
we could give more
>>>>>>>>>>>>
>>>>>>>>>>>> general
>>>>>>>>>> tools for making it easier to implement algorithms.
E.g. parameter
>>>>>>>>> server
>>>>>>>>>> [8,9]. Theo also mentioned in another thread that
TensorFlow has a
>>>>>>>>>> similar
>>>>>>>>>>>> model to Flink streaming, we could look into
that too. I think
>>>>>>>>>>>>
>>>>>>>>>>>> often
>>>>>>>>>> when
>>>>>>>> deploying a production ML system, much more configuration
and
>>>>>>>>>> tweaking
>>>>>>>>>> should be done than e.g. Spark MLlib allows. Why
not allow that?
>>>>>>>>> 7) Showcasing
>>>>>>>>>>>> Showcasing this could be easier. We could
say that we're doing
>>>>>>>>>>>>
>>>>>>>>>>>> batch
>>>>>>>>>> ML
>>>>>>>> with a streaming API. That's interesting in its own. IMHO
this
>>>>>>>>>> integration
>>>>>>>>>>>> is also a more approachable way towards end-to-end
ML.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pd
>>>>>>>>>>>> f
>>>>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>>>>> 13-final77.pdf
>>>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>>>>
>>>>>>>>>>>> pdf
>>>>>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Gabor
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>> *Yours faithfully, *
>>>>>>> *Kate Eri.*
>>>>>>>
>>>>>>>
>>>>>>>


Mime
View raw message