flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philipp Zehnder <philipp.zehn...@gmx.de>
Subject Re: [DISCUSS] Flink ML roadmap
Date Mon, 27 Feb 2017 17:39:34 GMT
Hello all,

I’m new to this mailing list and I wanted to introduce myself. My name is Philipp Zehnder
and I’m a Masters Student in Computer Science at the Karlsruhe Institute of Technology in
Germany currently writing on my master’s thesis with the main goal to integrate reusable
machine learning components into a stream processing network. One part of my thesis is to
create an API for distributed online machine learning.

I saw that there are some recent discussions how to continue the development of Flink ML [1]
and I want to share some of my experiences and maybe get some feedback from the community
for my ideas.

As I am new to open source projects I hope this is the right place for this.

In the beginning, I had a look at different already existing frameworks like Apache SAMOA
for example, which is great and has a lot of useful resources. However, as Flink is currently
focusing on streaming, from my point of view it makes sense to also have a streaming machine
learning API as part of the Flink ecosystem.

I’m currently working on building a prototype for a distributed streaming machine learning
library based on Flink that can be used for online and “classical” offline learning.

The machine learning algorithm takes labeled and non-labeled data. On a labeled data point
first a prediction is performed and then this label is used to train the model. On a non-labeled
data point just a prediction is performed. The main difference between the online and offline
algorithms is that in the offline case the labeled data must be handed to the model before
the unlabeled data. In the online case, it is still possible to process labeled data at a
later point to update the model. The advantage of this approach is that batch algorithms can
be applied on streaming data as well as online algorithms can be supported.

One difference to batch learning are the transformers that are used to preprocess the data.
For example, a simple mean subtraction must be implemented with a rolling mean, because we
can’t calculate the mean over all the data, but the Flink Streaming API is perfect for that.
It would be useful for users to have an extensible toolbox of transformers.

Another difference is the evaluation of the models. As we don’t have a single value to determine
the model quality, in streaming scenarios this value evolves over time when it sees more labeled
data.

However, the transformation and evaluation works again similar in both online learning and
offline learning. 

I also liked the discussion in [2] and I think that the competition in the batch learning
field is hard and there are already a lot of great projects. I think it is true that in most
real world problems it is not necessary to update the model immediately, but there are a lot
of use cases for machine learning on streams. For them it would be nice to have a native approach.

A stream machine learning API for Flink would fit very well and I would also be willing to
contribute to the future development of the Flink ML library.



Best regards,

Philipp

[1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html>
[2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2
<https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>


> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <mail@gaborhermann.com>:
> 
> Okay, I've created a skeleton of the design doc for choosing a direction:
> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> 
> Much of the pros/cons have already been discussed here, so I'll try to put there all
the arguments mentioned in this thread. Feel free to put there more :)
> 
> @Stavros: I agree we should take action fast. What about collecting our thoughts in the
doc by around Tuesday next week (28. February)? Then decide on the direction and design a
roadmap by around Friday (3. March)? Is that feasible, or should it take more time?
> 
> I think it will be necessary to have a shepherd, or even better a committer, to be involved
in at least reviewing and accepting the roadmap. It would be best, if a committer coordinated
all this.
> @Theodore: Would you like to do the coordination?
> 
> Regarding the use-cases: I've seen some abstracts of talks at SF Flink Forward [1] that
seem promising. There are companies already using Flink for ML [2,3,4,5].
> 
> [1] http://sf.flink-forward.org/program/sessions/
> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-streaming-vs-micro-batch-for-online-learning/
> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-learning-on-flink/
> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learning-scenarios-with-flink/
> 
> Cheers,
> Gabor
> 
> 
> On 2017-02-23 15:19, Katherin Eri wrote:
>> I have asked already some teams for useful cases, but all of them need time
>> to think.
>> During analysis something will finally arise.
>> May be we can ask partners of Flink  for cases? Data Artisans got results
>> of customers survey: [1], ML better support is wanted, so we could ask what
>> exactly is necessary.
>> 
>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>> 
>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
>> st.kontopoulos@gmail.com> написал:
>> 
>>> +100 for a design doc.
>>> 
>>> Could we also set a roadmap after some time-boxed investigation captured in
>>> that document? We need action.
>>> 
>>> Looking forward to work on this (whatever that might be) ;) Also are there
>>> any data supporting one direction or the other from a customer perspective?
>>> It would help to make more informed decisions.
>>> 
>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinmail@gmail.com>
>>> wrote:
>>> 
>>>> Yes, ok.
>>>> let's start some design document, and write down there already mentioned
>>>> ideas about: parameter server, about clipper and others. Would be nice if
>>>> we will also map this approaches to cases.
>>>> Will work on it collaboratively on each topic, may be finally we will
>>> form
>>>> some picture, that could be agreed with committers.
>>>> @Gabor, could you please start such shared doc, as you have already
>>> several
>>>> ideas proposed?
>>>> 
>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <mail@gaborhermann.com>:
>>>> 
>>>>> I agree, that it's better to go in one direction first, but I think
>>>>> online and offline with streaming API can go somewhat parallel later.
>>> We
>>>>> could set a short-term goal, concentrate initially on one direction,
>>> and
>>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>> direction to go. Would that be feasible?
>>>>> 
>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>> 
>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>> mean
>>>>>> doing nothing((((
>>>>>> I'm just afraid, that words: we will work on streaming not on
>>> batching,
>>>>> we
>>>>>> have no commiter's time for this, mean that yes, we started work
on
>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>> already
>>>>> was
>>>>>> with this ticket.
>>>>>> 
>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann"
<
>>>>> mail@gaborhermann.com>
>>>>>> написал:
>>>>>> 
>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>> is
>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>> there,
>>>>> if we
>>>>>>> go that way.
>>>>>>> 
>>>>>>> +1 for a design doc!
>>>>>>> 
>>>>>>> I would add that it's possible to make efforts in all the three
>>>>> directions
>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>> it
>>>>>>> might be worth to concentrate on one. E.g. it would not be so
useful
>>>> to
>>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>> API.
>>>>>>> We can decide later.
>>>>>>> 
>>>>>>> The design doc could be partitioned to these 3 directions, and
we
>>> can
>>>>>>> collect there the pros/cons too. What do you think?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Gabor
>>>>>>> 
>>>>>>> 
>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>> 
>>>>>>>> Hello all,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> @Gabor, we have discussed the idea of using the streaming
API to
>>>> write
>>>>> all
>>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>> and I think it might be possible and is generally worth a
shot. The
>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
exactly
>>>>>>>> "online", but rather "fast-batch".
>>>>>>>> 
>>>>>>>> There will be problems popping up again, even for very simple
algos
>>>>> like
>>>>>>>> on
>>>>>>>> line linear regression with SGD [1], but hopefully fixing
those
>>> will
>>>> be
>>>>>>>> more aligned with the priorities of the community.
>>>>>>>> 
>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>> there
>>>>> is
>>>>>>>> no development effort focused on batch processing right now.
>>>>>>>> 
>>>>>>>> So to summarize, it seems like there are people willing to
work on
>>> ML
>>>>> on
>>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>> There are many directions we could take (batch, online, batch
on
>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>> 
>>>>>>>> If you want we can start a design doc and move the conversation
>>>> there,
>>>>>>>> come
>>>>>>>> up with a roadmap and start implementing.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Theodore
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>> tamps-td10241.html
>>>>>>>> 
>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
>>>> mail@gaborhermann.com
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>> 
>>>>>>>>> I think building a developer community (Till's 2. point)
can be
>>>>> slightly
>>>>>>>>> separated from what features we should aim for (1. point)
and
>>>>> showcasing
>>>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>> restructuring,
>>>>> I'm
>>>>>>>>> sure we'll find a way to make the development process
more
>>> dynamic.
>>>>> I'll
>>>>>>>>> try to address the rest here.
>>>>>>>>> 
>>>>>>>>> It's hard to choose directions between streaming and
batch ML. As
>>>> Theo
>>>>>>>>> has
>>>>>>>>> indicated, not much online ML is used in production,
but Flink
>>>>>>>>> concentrates
>>>>>>>>> on streaming, so online ML would be a better fit for
Flink.
>>> However,
>>>>> as
>>>>>>>>> most of you argued, there's definite need for batch ML.
But batch
>>> ML
>>>>>>>>> seems
>>>>>>>>> hard to achieve because there are blocking issues with
persisting,
>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>> 
>>>>>>>>> I propose a seemingly crazy solution: what if we developed
batch
>>>>>>>>> algorithms also with the streaming API? The batch API
would
>>> clearly
>>>>> seem
>>>>>>>>> more suitable for ML algorithms, but there a lot of benefits
of
>>> this
>>>>>>>>> approach too, so it's clearly worth considering. Flink
also has
>>> the
>>>>> high
>>>>>>>>> level vision of "streaming for everything" that would
clearly fit
>>>> this
>>>>>>>>> case. What do you all think about this? Do you think
this solution
>>>>> would
>>>>>>>>> be
>>>>>>>>> feasible? I would be happy to make a more elaborate proposal,
but
>>> I
>>>>> push
>>>>>>>>> my
>>>>>>>>> main ideas here:
>>>>>>>>> 
>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>> It could simplify the work of both the users and the
developers.
>>> One
>>>>>>>>> could
>>>>>>>>> execute training once, or could execute it periodically
e.g. by
>>>> using
>>>>>>>>> windows. Low-latency serving and training could be done
in the
>>> same
>>>>>>>>> system.
>>>>>>>>> We could implement incremental algorithms, without any
side inputs
>>>> for
>>>>>>>>> combining online learning (or predictions) with batch
learning. Of
>>>>>>>>> course,
>>>>>>>>> all the logic describing these must be somehow implemented
(e.g.
>>>>>>>>> synchronizing predictions with training), but it should
be easier
>>> to
>>>>> do
>>>>>>>>> so
>>>>>>>>> in one system, than by combining e.g. the batch and streaming
API.
>>>>>>>>> 
>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>> Despite these benefits, it could seem harder to implement
batch ML
>>>>> with
>>>>>>>>> the streaming API, but in my opinion it's not. There
are more
>>>>> flexible,
>>>>>>>>> lower-level optimization potentials with the streaming
API. Most
>>>>>>>>> distributed ML algorithms use a lower-level model than
the batch
>>> API
>>>>>>>>> anyway, so sometimes it feels like forcing the algorithm
logic
>>> into
>>>>> the
>>>>>>>>> training API and tweaking it. Although we could not use
the batch
>>>>>>>>> primitives like join, we would have the E.g. in my experience
with
>>>>>>>>> implementing a distributed matrix factorization algorithm
[1], I
>>>>> couldn't
>>>>>>>>> do a simple optimization because of the limitations of
the
>>> iteration
>>>>> API
>>>>>>>>> [2]. Even if we pushed all the development effort to
make the
>>> batch
>>>>> API
>>>>>>>>> more suitable for ML there would be things we couldn't
do. E.g.
>>>> there
>>>>> are
>>>>>>>>> approaches for updating a model iteratively without locks
[3,4]
>>>> (i.e.
>>>>>>>>> somewhat asynchronously), and I don't see a clear way
to implement
>>>>> such
>>>>>>>>> algorithms with the batch API.
>>>>>>>>> 
>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>> The Flink streaming community in general would also benefit
from
>>>> this
>>>>>>>>> direction. There are many features needed in the streaming
API for
>>>> ML
>>>>> to
>>>>>>>>> work, but this is also true for the batch API. One really
>>> important
>>>> is
>>>>>>>>> the
>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has
been a lot
>>>> of
>>>>>>>>> effort (mostly from Paris) for making it mature enough
[6]. Kate
>>>>>>>>> mentioned
>>>>>>>>> using GPUs, and I'm sure they have uses in streaming
generally
>>> [7].
>>>>> Thus,
>>>>>>>>> by improving the streaming API to allow ML algorithms,
the
>>> streaming
>>>>> API
>>>>>>>>> benefit too (which is important as they have a lot more
production
>>>>> users
>>>>>>>>> than the batch API).
>>>>>>>>> 
>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>> I believe the same performance could be achieved with
the
>>> streaming
>>>>> API
>>>>>>>>> as
>>>>>>>>> with the batch API. Streaming API is much closer to the
runtime
>>> than
>>>>> the
>>>>>>>>> batch API. For corner-cases, with runtime-layer optimizations
of
>>>> batch
>>>>>>>>> API,
>>>>>>>>> we could find a way to do the same (or similar) optimization
for
>>> the
>>>>>>>>> streaming API (see my previous point). Such case could
be using
>>>>> managed
>>>>>>>>> memory (and spilling to disk). There are also benefits
by default,
>>>>> e.g.
>>>>>>>>> we
>>>>>>>>> would have a finer grained fault tolerance with the streaming
API.
>>>>>>>>> 
>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>> For the shorter term, we should not throw away all the
algorithms
>>>>>>>>> implemented with the batch API. By pushing forward the
development
>>>>> with
>>>>>>>>> side inputs we could make them usable with streaming
API. Then, if
>>>> the
>>>>>>>>> library gains some popularity, we could replace the algorithms
in
>>>> the
>>>>>>>>> batch
>>>>>>>>> API with streaming ones, to avoid the performance costs
of e.g.
>>> not
>>>>> being
>>>>>>>>> able to persist.
>>>>>>>>> 
>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>> Besides implementing algorithms one by one, we could
give more
>>>> general
>>>>>>>>> tools for making it easier to implement algorithms. E.g.
parameter
>>>>> server
>>>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow
has a
>>>>>>>>> similar
>>>>>>>>> model to Flink streaming, we could look into that too.
I think
>>> often
>>>>> when
>>>>>>>>> deploying a production ML system, much more configuration
and
>>>> tweaking
>>>>>>>>> should be done than e.g. Spark MLlib allows. Why not
allow that?
>>>>>>>>> 
>>>>>>>>> 7) Showcasing
>>>>>>>>> Showcasing this could be easier. We could say that we're
doing
>>> batch
>>>>> ML
>>>>>>>>> with a streaming API. That's interesting in its own.
IMHO this
>>>>>>>>> integration
>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks for reading so far :)
>>>>>>>>> 
>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>> 13-final77.pdf
>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>> pdf
>>>>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Gabor
>>>>>>>>> 
>>>>>>>>> 
>>>>> --
>>>> *Yours faithfully, *
>>>> 
>>>> *Kate Eri.*
>>>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message