flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stavros Kontopoulos <st.kontopou...@gmail.com>
Subject Re: [DISCUSS] Flink ML roadmap
Date Fri, 03 Mar 2017 11:24:50 GMT
Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
supports streaming. I am not aware of its market share, anyone?

Best,
Stavros

On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com> wrote:

> Thank you for the links Roberto I did not know that Beam was working on an
> ML abstraction as well. I'm sure we can learn from that.
>
> I'll start another thread today where we can discuss next steps and action
> points now that we have a few different paths to follow listed on the
> shared doc,
> since our deadline was today. We welcome further discussions of course.
>
> Regards,
> Theodore
>
> On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> roberto.bentivoglio@radicalbit.io> wrote:
>
> > Hi All,
> >
> > First of all I'd like to introduce myself: my name is Roberto Bentivoglio
> > and I'm currently working for Radicalbit as Andrea Spina (he already
> wrote
> > on this thread).
> > I didn't have the chance to directly contribute on Flink up to now, but
> > some colleagues of mine are doing that since at least one year (they
> > contributed also on the machine learning library).
> >
> > I hope I'm not jumping into discussione too late, it's really interesting
> > and the analysis document is depicting really well the scenarios
> currently
> > available. Many thanks for your effort!
> >
> > If I can add my two cents to the discussion I'd like to add the
> following:
> >  - it's clear that currently the Flink community is deeply focused on
> > streaming features than batch features. For this reason I think that
> > implement "Offline learning with Streaming API" is really a great idea.
> >  - I think that the "Online learning" option is really a good fit for
> > Flink, but maybe we could give at the beginning an higher priority to the
> > "Offline learning with Streaming API" option. However I think that this
> > option will be the main goal for the mid/long term.
> >  - we implemented a library based on jpmml-evaluator[1] and flink called
> > "flink-jpmml". Using this library you can train the models on external
> > systems and use those models, after you've exported in a PMML standard
> > format, to run evaluations on top of DataStream API. We don't have open
> > sourced this library up to now, but we're planning to do this in the next
> > weeks. We'd like to complete the documentation and the final code reviews
> > before to share it. I hope it will be helpful for the community to
> enhance
> > the ML support on Flink
> >  - I'd like also to mention that the Apache Beam community is thiking on
> a
> > ML DSL. There is a design document and a couple of Jira tasks for that
> > [2][3]
> >
> > We're really keen to focus our effort to improve the ML support on Flink
> in
> > Radicalbit, we will contribute on this effort for sure on a regular basis
> > with our team.
> >
> > Looking forward to work with you!
> >
> > Many thanks,
> > Roberto
> >
> > [1] - https://github.com/jpmml/jpmml-evaluator
> > [2] -
> > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > ECHb-xA
> > [3] - https://issues.apache.org/jira/browse/BEAM-303
> >
> > On 28 February 2017 at 19:35, Gábor Hermann <mail@gaborhermann.com>
> wrote:
> >
> > > Hi Philipp,
> > >
> > > It's great to hear you are interested in Flink ML!
> > >
> > > Based on your description, your prototype seems like an interesting
> > > approach for combining online+offline learning. If you're interested,
> we
> > > might find a way to integrate your work, or at least your ideas, into
> > Flink
> > > ML if we decide on a direction that fits your approach. I think your
> work
> > > could be relevant for almost all the directions listed there (if I
> > > understand correctly you'd even like to serve predictions on unlabeled
> > > data).
> > >
> > > Feel free to join the discussion in the docs you've mentioned :)
> > >
> > > Cheers,
> > > Gabor
> > >
> > >
> > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > >
> > > Hello all,
> > >>
> > >> I’m new to this mailing list and I wanted to introduce myself. My name
> > is
> > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > >> master’s thesis with the main goal to integrate reusable machine
> > learning
> > >> components into a stream processing network. One part of my thesis is
> to
> > >> create an API for distributed online machine learning.
> > >>
> > >> I saw that there are some recent discussions how to continue the
> > >> development of Flink ML [1] and I want to share some of my experiences
> > and
> > >> maybe get some feedback from the community for my ideas.
> > >>
> > >> As I am new to open source projects I hope this is the right place for
> > >> this.
> > >>
> > >> In the beginning, I had a look at different already existing
> frameworks
> > >> like Apache SAMOA for example, which is great and has a lot of useful
> > >> resources. However, as Flink is currently focusing on streaming, from
> my
> > >> point of view it makes sense to also have a streaming machine learning
> > API
> > >> as part of the Flink ecosystem.
> > >>
> > >> I’m currently working on building a prototype for a distributed
> > streaming
> > >> machine learning library based on Flink that can be used for online
> and
> > >> “classical” offline learning.
> > >>
> > >> The machine learning algorithm takes labeled and non-labeled data. On
> a
> > >> labeled data point first a prediction is performed and then this label
> > is
> > >> used to train the model. On a non-labeled data point just a prediction
> > is
> > >> performed. The main difference between the online and offline
> > algorithms is
> > >> that in the offline case the labeled data must be handed to the model
> > >> before the unlabeled data. In the online case, it is still possible to
> > >> process labeled data at a later point to update the model. The
> > advantage of
> > >> this approach is that batch algorithms can be applied on streaming
> data
> > as
> > >> well as online algorithms can be supported.
> > >>
> > >> One difference to batch learning are the transformers that are used to
> > >> preprocess the data. For example, a simple mean subtraction must be
> > >> implemented with a rolling mean, because we can’t calculate the mean
> > over
> > >> all the data, but the Flink Streaming API is perfect for that. It
> would
> > be
> > >> useful for users to have an extensible toolbox of transformers.
> > >>
> > >> Another difference is the evaluation of the models. As we don’t have
a
> > >> single value to determine the model quality, in streaming scenarios
> this
> > >> value evolves over time when it sees more labeled data.
> > >>
> > >> However, the transformation and evaluation works again similar in both
> > >> online learning and offline learning.
> > >>
> > >> I also liked the discussion in [2] and I think that the competition in
> > >> the batch learning field is hard and there are already a lot of great
> > >> projects. I think it is true that in most real world problems it is
> not
> > >> necessary to update the model immediately, but there are a lot of use
> > cases
> > >> for machine learning on streams. For them it would be nice to have a
> > native
> > >> approach.
> > >>
> > >> A stream machine learning API for Flink would fit very well and I
> would
> > >> also be willing to contribute to the future development of the Flink
> ML
> > >> library.
> > >>
> > >>
> > >>
> > >> Best regards,
> > >>
> > >> Philipp
> > >>
> > >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > >> com/DISCUSS-Flink-ML-roadmap-td16040.html <
> > http://apache-flink-mailing-l
> > >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-
> td16040.html
> > >
> > >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
> > >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
> > >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
> > >>
> > >>
> > >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <mail@gaborhermann.com>:
> > >>>
> > >>> Okay, I've created a skeleton of the design doc for choosing a
> > direction:
> > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> > >>>
> > >>> Much of the pros/cons have already been discussed here, so I'll try
> to
> > >>> put there all the arguments mentioned in this thread. Feel free to
> put
> > >>> there more :)
> > >>>
> > >>> @Stavros: I agree we should take action fast. What about collecting
> our
> > >>> thoughts in the doc by around Tuesday next week (28. February)? Then
> > decide
> > >>> on the direction and design a roadmap by around Friday (3. March)?
Is
> > that
> > >>> feasible, or should it take more time?
> > >>>
> > >>> I think it will be necessary to have a shepherd, or even better a
> > >>> committer, to be involved in at least reviewing and accepting the
> > roadmap.
> > >>> It would be best, if a committer coordinated all this.
> > >>> @Theodore: Would you like to do the coordination?
> > >>>
> > >>> Regarding the use-cases: I've seen some abstracts of talks at SF
> Flink
> > >>> Forward [1] that seem promising. There are companies already using
> > Flink
> > >>> for ML [2,3,4,5].
> > >>>
> > >>> [1] http://sf.flink-forward.org/program/sessions/
> > >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> > >>> eaming-vs-micro-batch-for-online-learning/
> > >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> > >>> nsorflow/
> > >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> > >>> arning-on-flink/
> > >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> > >>> ing-scenarios-with-flink/
> > >>>
> > >>> Cheers,
> > >>> Gabor
> > >>>
> > >>>
> > >>> On 2017-02-23 15:19, Katherin Eri wrote:
> > >>>
> > >>>> I have asked already some teams for useful cases, but all of them
> need
> > >>>> time
> > >>>> to think.
> > >>>> During analysis something will finally arise.
> > >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> > >>>> results
> > >>>> of customers survey: [1], ML better support is wanted, so we could
> ask
> > >>>> what
> > >>>> exactly is necessary.
> > >>>>
> > >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> > >>>>
> > >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros
Kontopoulos" <
> > >>>> st.kontopoulos@gmail.com> написал:
> > >>>>
> > >>>> +100 for a design doc.
> > >>>>>
> > >>>>> Could we also set a roadmap after some time-boxed investigation
> > >>>>> captured in
> > >>>>> that document? We need action.
> > >>>>>
> > >>>>> Looking forward to work on this (whatever that might be) ;)
Also
> are
> > >>>>> there
> > >>>>> any data supporting one direction or the other from a customer
> > >>>>> perspective?
> > >>>>> It would help to make more informed decisions.
> > >>>>>
> > >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> > katherinmail@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Yes, ok.
> > >>>>>> let's start some design document, and write down there
already
> > >>>>>> mentioned
> > >>>>>> ideas about: parameter server, about clipper and others.
Would be
> > >>>>>> nice if
> > >>>>>> we will also map this approaches to cases.
> > >>>>>> Will work on it collaboratively on each topic, may be finally
we
> > will
> > >>>>>>
> > >>>>> form
> > >>>>>
> > >>>>>> some picture, that could be agreed with committers.
> > >>>>>> @Gabor, could you please start such shared doc, as you
have
> already
> > >>>>>>
> > >>>>> several
> > >>>>>
> > >>>>>> ideas proposed?
> > >>>>>>
> > >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <mail@gaborhermann.com>:
> > >>>>>>
> > >>>>>> I agree, that it's better to go in one direction first,
but I
> think
> > >>>>>>> online and offline with streaming API can go somewhat
parallel
> > later.
> > >>>>>>>
> > >>>>>> We
> > >>>>>
> > >>>>>> could set a short-term goal, concentrate initially on one
> direction,
> > >>>>>>>
> > >>>>>> and
> > >>>>>
> > >>>>>> showcase that direction (e.g. in a blogpost). But first,
we should
> > >>>>>>> list
> > >>>>>>> the pros/cons in a design doc as a minimum. Then make
a decision
> > what
> > >>>>>>> direction to go. Would that be feasible?
> > >>>>>>>
> > >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> > >>>>>>>
> > >>>>>>> I'm not sure that this is feasible, doing all at the
same time
> > could
> > >>>>>>>>
> > >>>>>>> mean
> > >>>>>>
> > >>>>>>> doing nothing((((
> > >>>>>>>> I'm just afraid, that words: we will work on streaming
not on
> > >>>>>>>>
> > >>>>>>> batching,
> > >>>>>
> > >>>>>> we
> > >>>>>>>
> > >>>>>>>> have no commiter's time for this, mean that yes,
we started work
> > on
> > >>>>>>>> FLINK-1730, but nobody will commit this work in
the end, as it
> > >>>>>>>>
> > >>>>>>> already
> > >>>>>
> > >>>>>> was
> > >>>>>>>
> > >>>>>>>> with this ticket.
> > >>>>>>>>
> > >>>>>>>> 23 февр. 2017 г. 14:26 пользователь
"Gábor Hermann" <
> > >>>>>>>>
> > >>>>>>> mail@gaborhermann.com>
> > >>>>>>>
> > >>>>>>>> написал:
> > >>>>>>>>
> > >>>>>>>> @Theodore: Great to hear you think the "batch on
streaming"
> > approach
> > >>>>>>>>>
> > >>>>>>>> is
> > >>>>>>
> > >>>>>>> possible! Of course, we need to pay attention all the
pitfalls
> > >>>>>>>>>
> > >>>>>>>> there,
> > >>>>>
> > >>>>>> if we
> > >>>>>>>
> > >>>>>>>> go that way.
> > >>>>>>>>>
> > >>>>>>>>> +1 for a design doc!
> > >>>>>>>>>
> > >>>>>>>>> I would add that it's possible to make efforts
in all the three
> > >>>>>>>>>
> > >>>>>>>> directions
> > >>>>>>>
> > >>>>>>>> (i.e. batch, online, batch on streaming) at the
same time.
> > Although,
> > >>>>>>>>>
> > >>>>>>>> it
> > >>>>>>
> > >>>>>>> might be worth to concentrate on one. E.g. it would
not be so
> > useful
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>>>
> > >>>>>>> have the same batch algorithms with both the batch
API and
> > streaming
> > >>>>>>>>>
> > >>>>>>>> API.
> > >>>>>>>
> > >>>>>>>> We can decide later.
> > >>>>>>>>>
> > >>>>>>>>> The design doc could be partitioned to these
3 directions, and
> we
> > >>>>>>>>>
> > >>>>>>>> can
> > >>>>>
> > >>>>>> collect there the pros/cons too. What do you think?
> > >>>>>>>>>
> > >>>>>>>>> Cheers,
> > >>>>>>>>> Gabor
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hello all,
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> @Gabor, we have discussed the idea of using
the streaming API
> to
> > >>>>>>>>>>
> > >>>>>>>>> write
> > >>>>>>
> > >>>>>>> all
> > >>>>>>>
> > >>>>>>>> of our ML algorithms with a couple of people offline,
> > >>>>>>>>>> and I think it might be possible and is
generally worth a
> shot.
> > >>>>>>>>>> The
> > >>>>>>>>>> approach we would take would be close to
Vowpal Wabbit, not
> > >>>>>>>>>> exactly
> > >>>>>>>>>> "online", but rather "fast-batch".
> > >>>>>>>>>>
> > >>>>>>>>>> There will be problems popping up again,
even for very simple
> > >>>>>>>>>> algos
> > >>>>>>>>>>
> > >>>>>>>>> like
> > >>>>>>>
> > >>>>>>>> on
> > >>>>>>>>>> line linear regression with SGD [1], but
hopefully fixing
> those
> > >>>>>>>>>>
> > >>>>>>>>> will
> > >>>>>
> > >>>>>> be
> > >>>>>>
> > >>>>>>> more aligned with the priorities of the community.
> > >>>>>>>>>>
> > >>>>>>>>>> @Katherin, my understanding is that given
the limited
> resources,
> > >>>>>>>>>>
> > >>>>>>>>> there
> > >>>>>>
> > >>>>>>> is
> > >>>>>>>
> > >>>>>>>> no development effort focused on batch processing
right now.
> > >>>>>>>>>>
> > >>>>>>>>>> So to summarize, it seems like there are
people willing to
> work
> > on
> > >>>>>>>>>>
> > >>>>>>>>> ML
> > >>>>>
> > >>>>>> on
> > >>>>>>>
> > >>>>>>>> Flink, but nobody is sure how to do it.
> > >>>>>>>>>> There are many directions we could take
(batch, online, batch
> on
> > >>>>>>>>>> streaming), each with its own merits and
downsides.
> > >>>>>>>>>>
> > >>>>>>>>>> If you want we can start a design doc and
move the
> conversation
> > >>>>>>>>>>
> > >>>>>>>>> there,
> > >>>>>>
> > >>>>>>> come
> > >>>>>>>>>> up with a roadmap and start implementing.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>> Theodore
> > >>>>>>>>>>
> > >>>>>>>>>> [1]
> > >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
> > >>>>>>>>>> tamps-td10241.html
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor
Hermann <
> > >>>>>>>>>>
> > >>>>>>>>> mail@gaborhermann.com
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> It's great to see so much activity in this
discussion :)
> > >>>>>>>>>>
> > >>>>>>>>>>> I'll try to add my thoughts.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think building a developer community
(Till's 2. point) can
> be
> > >>>>>>>>>>>
> > >>>>>>>>>> slightly
> > >>>>>>>
> > >>>>>>>> separated from what features we should aim for
(1. point) and
> > >>>>>>>>>>>
> > >>>>>>>>>> showcasing
> > >>>>>>>
> > >>>>>>>> (3. point). Thanks Till for bringing up the ideas
for
> > >>>>>>>>>>>
> > >>>>>>>>>> restructuring,
> > >>>>>
> > >>>>>> I'm
> > >>>>>>>
> > >>>>>>>> sure we'll find a way to make the development process
more
> > >>>>>>>>>>>
> > >>>>>>>>>> dynamic.
> > >>>>>
> > >>>>>> I'll
> > >>>>>>>
> > >>>>>>>> try to address the rest here.
> > >>>>>>>>>>>
> > >>>>>>>>>>> It's hard to choose directions between
streaming and batch
> ML.
> > As
> > >>>>>>>>>>>
> > >>>>>>>>>> Theo
> > >>>>>>
> > >>>>>>> has
> > >>>>>>>>>>> indicated, not much online ML is used
in production, but
> Flink
> > >>>>>>>>>>> concentrates
> > >>>>>>>>>>> on streaming, so online ML would be
a better fit for Flink.
> > >>>>>>>>>>>
> > >>>>>>>>>> However,
> > >>>>>
> > >>>>>> as
> > >>>>>>>
> > >>>>>>>> most of you argued, there's definite need for batch
ML. But
> batch
> > >>>>>>>>>>>
> > >>>>>>>>>> ML
> > >>>>>
> > >>>>>> seems
> > >>>>>>>>>>> hard to achieve because there are blocking
issues with
> > >>>>>>>>>>> persisting,
> > >>>>>>>>>>> iteration paths etc. So it's no good
either way.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I propose a seemingly crazy solution:
what if we developed
> > batch
> > >>>>>>>>>>> algorithms also with the streaming
API? The batch API would
> > >>>>>>>>>>>
> > >>>>>>>>>> clearly
> > >>>>>
> > >>>>>> seem
> > >>>>>>>
> > >>>>>>>> more suitable for ML algorithms, but there a lot
of benefits of
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>
> > >>>>>> approach too, so it's clearly worth considering. Flink
also has
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>
> > >>>>>> high
> > >>>>>>>
> > >>>>>>>> level vision of "streaming for everything" that
would clearly
> fit
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>
> > >>>>>>> case. What do you all think about this? Do you think
this
> solution
> > >>>>>>>>>>>
> > >>>>>>>>>> would
> > >>>>>>>
> > >>>>>>>> be
> > >>>>>>>>>>> feasible? I would be happy to make
a more elaborate proposal,
> > but
> > >>>>>>>>>>>
> > >>>>>>>>>> I
> > >>>>>
> > >>>>>> push
> > >>>>>>>
> > >>>>>>>> my
> > >>>>>>>>>>> main ideas here:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1) Simplifying by using one system
> > >>>>>>>>>>> It could simplify the work of both
the users and the
> > developers.
> > >>>>>>>>>>>
> > >>>>>>>>>> One
> > >>>>>
> > >>>>>> could
> > >>>>>>>>>>> execute training once, or could execute
it periodically e.g.
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> using
> > >>>>>>
> > >>>>>>> windows. Low-latency serving and training could be
done in the
> > >>>>>>>>>>>
> > >>>>>>>>>> same
> > >>>>>
> > >>>>>> system.
> > >>>>>>>>>>> We could implement incremental algorithms,
without any side
> > >>>>>>>>>>> inputs
> > >>>>>>>>>>>
> > >>>>>>>>>> for
> > >>>>>>
> > >>>>>>> combining online learning (or predictions) with batch
learning.
> Of
> > >>>>>>>>>>> course,
> > >>>>>>>>>>> all the logic describing these must
be somehow implemented
> > (e.g.
> > >>>>>>>>>>> synchronizing predictions with training),
but it should be
> > easier
> > >>>>>>>>>>>
> > >>>>>>>>>> to
> > >>>>>
> > >>>>>> do
> > >>>>>>>
> > >>>>>>>> so
> > >>>>>>>>>>> in one system, than by combining e.g.
the batch and streaming
> > >>>>>>>>>>> API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2) Batch ML with the streaming API
is not harder
> > >>>>>>>>>>> Despite these benefits, it could seem
harder to implement
> batch
> > >>>>>>>>>>> ML
> > >>>>>>>>>>>
> > >>>>>>>>>> with
> > >>>>>>>
> > >>>>>>>> the streaming API, but in my opinion it's not.
There are more
> > >>>>>>>>>>>
> > >>>>>>>>>> flexible,
> > >>>>>>>
> > >>>>>>>> lower-level optimization potentials with the streaming
API. Most
> > >>>>>>>>>>> distributed ML algorithms use a lower-level
model than the
> > batch
> > >>>>>>>>>>>
> > >>>>>>>>>> API
> > >>>>>
> > >>>>>> anyway, so sometimes it feels like forcing the algorithm
logic
> > >>>>>>>>>>>
> > >>>>>>>>>> into
> > >>>>>
> > >>>>>> the
> > >>>>>>>
> > >>>>>>>> training API and tweaking it. Although we could
not use the
> batch
> > >>>>>>>>>>> primitives like join, we would have
the E.g. in my experience
> > >>>>>>>>>>> with
> > >>>>>>>>>>> implementing a distributed matrix factorization
algorithm
> [1],
> > I
> > >>>>>>>>>>>
> > >>>>>>>>>> couldn't
> > >>>>>>>
> > >>>>>>>> do a simple optimization because of the limitations
of the
> > >>>>>>>>>>>
> > >>>>>>>>>> iteration
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> [2]. Even if we pushed all the development effort
to make the
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> more suitable for ML there would be things we couldn't
do. E.g.
> > >>>>>>>>>>>
> > >>>>>>>>>> there
> > >>>>>>
> > >>>>>>> are
> > >>>>>>>
> > >>>>>>>> approaches for updating a model iteratively without
locks [3,4]
> > >>>>>>>>>>>
> > >>>>>>>>>> (i.e.
> > >>>>>>
> > >>>>>>> somewhat asynchronously), and I don't see a clear way
to
> implement
> > >>>>>>>>>>>
> > >>>>>>>>>> such
> > >>>>>>>
> > >>>>>>>> algorithms with the batch API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 3) Streaming community (users and devs)
benefit
> > >>>>>>>>>>> The Flink streaming community in general
would also benefit
> > from
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>
> > >>>>>>> direction. There are many features needed in the streaming
API
> for
> > >>>>>>>>>>>
> > >>>>>>>>>> ML
> > >>>>>>
> > >>>>>>> to
> > >>>>>>>
> > >>>>>>>> work, but this is also true for the batch API.
One really
> > >>>>>>>>>>>
> > >>>>>>>>>> important
> > >>>>>
> > >>>>>> is
> > >>>>>>
> > >>>>>>> the
> > >>>>>>>>>>> loops API (a.k.a. iterative DataStreams)
[5]. There has been
> a
> > >>>>>>>>>>> lot
> > >>>>>>>>>>>
> > >>>>>>>>>> of
> > >>>>>>
> > >>>>>>> effort (mostly from Paris) for making it mature enough
[6]. Kate
> > >>>>>>>>>>> mentioned
> > >>>>>>>>>>> using GPUs, and I'm sure they have
uses in streaming
> generally
> > >>>>>>>>>>>
> > >>>>>>>>>> [7].
> > >>>>>
> > >>>>>> Thus,
> > >>>>>>>
> > >>>>>>>> by improving the streaming API to allow ML algorithms,
the
> > >>>>>>>>>>>
> > >>>>>>>>>> streaming
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> benefit too (which is important as they have a
lot more
> production
> > >>>>>>>>>>>
> > >>>>>>>>>> users
> > >>>>>>>
> > >>>>>>>> than the batch API).
> > >>>>>>>>>>>
> > >>>>>>>>>>> 4) Performance can be at least as good
> > >>>>>>>>>>> I believe the same performance could
be achieved with the
> > >>>>>>>>>>>
> > >>>>>>>>>> streaming
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> as
> > >>>>>>>>>>> with the batch API. Streaming API is
much closer to the
> runtime
> > >>>>>>>>>>>
> > >>>>>>>>>> than
> > >>>>>
> > >>>>>> the
> > >>>>>>>
> > >>>>>>>> batch API. For corner-cases, with runtime-layer
optimizations of
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>>
> > >>>>>>> API,
> > >>>>>>>>>>> we could find a way to do the same
(or similar) optimization
> > for
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>
> > >>>>>> streaming API (see my previous point). Such case could
be using
> > >>>>>>>>>>>
> > >>>>>>>>>> managed
> > >>>>>>>
> > >>>>>>>> memory (and spilling to disk). There are also benefits
by
> default,
> > >>>>>>>>>>>
> > >>>>>>>>>> e.g.
> > >>>>>>>
> > >>>>>>>> we
> > >>>>>>>>>>> would have a finer grained fault tolerance
with the streaming
> > >>>>>>>>>>> API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 5) We could keep batch ML API
> > >>>>>>>>>>> For the shorter term, we should not
throw away all the
> > algorithms
> > >>>>>>>>>>> implemented with the batch API. By
pushing forward the
> > >>>>>>>>>>> development
> > >>>>>>>>>>>
> > >>>>>>>>>> with
> > >>>>>>>
> > >>>>>>>> side inputs we could make them usable with streaming
API. Then,
> if
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> library gains some popularity, we could replace the
algorithms in
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> batch
> > >>>>>>>>>>> API with streaming ones, to avoid the
performance costs of
> e.g.
> > >>>>>>>>>>>
> > >>>>>>>>>> not
> > >>>>>
> > >>>>>> being
> > >>>>>>>
> > >>>>>>>> able to persist.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 6) General tools for implementing ML
algorithms
> > >>>>>>>>>>> Besides implementing algorithms one
by one, we could give
> more
> > >>>>>>>>>>>
> > >>>>>>>>>> general
> > >>>>>>
> > >>>>>>> tools for making it easier to implement algorithms.
E.g.
> parameter
> > >>>>>>>>>>>
> > >>>>>>>>>> server
> > >>>>>>>
> > >>>>>>>> [8,9]. Theo also mentioned in another thread that
TensorFlow
> has a
> > >>>>>>>>>>> similar
> > >>>>>>>>>>> model to Flink streaming, we could
look into that too. I
> think
> > >>>>>>>>>>>
> > >>>>>>>>>> often
> > >>>>>
> > >>>>>> when
> > >>>>>>>
> > >>>>>>>> deploying a production ML system, much more configuration
and
> > >>>>>>>>>>>
> > >>>>>>>>>> tweaking
> > >>>>>>
> > >>>>>>> should be done than e.g. Spark MLlib allows. Why not
allow that?
> > >>>>>>>>>>>
> > >>>>>>>>>>> 7) Showcasing
> > >>>>>>>>>>> Showcasing this could be easier. We
could say that we're
> doing
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>
> > >>>>>> ML
> > >>>>>>>
> > >>>>>>>> with a streaming API. That's interesting in its
own. IMHO this
> > >>>>>>>>>>> integration
> > >>>>>>>>>>> is also a more approachable way towards
end-to-end ML.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for reading so far :)
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
> > >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > >>>>>>>>>>> [3] https://people.eecs.berkeley.
> edu/~brecht/papers/hogwildTR.
> > pd
> > >>>>>>>>>>> f
> > >>>>>>>>>>> [4] https://www.usenix.org/system/
> > files/conference/hotos13/hotos
> > >>>>>>>>>>> 13-final77.pdf
> > >>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 15+
> > >>>>>>>>>>> Scoped+Loops+and+Job+Termination
> > >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
> > >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-
> > sigmod16.
> > >>>>>>>>>>>
> > >>>>>>>>>> pdf
> > >>>>>
> > >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> > >>>>>>>>>>> [9] http://apache-flink-mailing-
> list-archive.1008284.n3.nabble
> > .
> > >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > >>>>>>>>>>> Parameter-Server-implementation-td15880.html
> > >>>>>>>>>>>
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Gabor
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>
> > >>>>>> *Yours faithfully, *
> > >>>>>>
> > >>>>>> *Kate Eri.*
> > >>>>>>
> > >>>>>>
> > >>
> > >
> >
> >
> > --
> > Roberto Bentivoglio
> > CTO
> > e. roberto.bentivoglio@radicalbit.io
> > Radicalbit S.r.l.
> > radicalbit.io
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message