flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Katherin Eri <katherinm...@gmail.com>
Subject Re: [DISCUSS] Flink ML roadmap
Date Tue, 21 Feb 2017 15:48:30 GMT
Till, thank you for your response.
But I need several points to clarify:

1) Yes, batch and batch ML is the field full of alternatives, but in my
opinion that doesn’t mean that we should ignore the problem of not
developing batch part of Flink. You know: Apache Beam, Apache Mahout they
both feel the lack of normally implemented batching feature. DL4J will be
able to integrate with Apache Flink, but this integration will work only on
paper, and not efficient in production.

Did you mean with this phrase: “*Unfortunately, all of these problems are
far from trivial to solve and will require quite some changes to Flink's
runtime. Given Flink's current focus on stream processing, I don't see
enough community capacities left to implement these features soon*.”, that
Apache Flink won’t pay attention to batch part of it, or I have got you
wrong?

2) Yes, reimplementing libraries that already were developed by community
is not a good way, but maybe we should make from Flink engine that could
easily run ML libraries on top of it: integrate with SystemML, DL4J and
many many others? But doing this we well still required batch calculations.

вт, 21 февр. 2017 г. в 18:01, Stavros Kontopoulos <st.kontopoulos@gmail.com
>:

> Ok I see. Suppose we solve all the critical issues. And suppose we dont go
> with the pure online model (although online ML has a potential)... should
> we move on with the
> current ML implementation which is for batch processing (to the best of my
> knowledge)? The parameter server problem is a long standing one and many
> companies out there started to provide their own solutions. That would be
> very useful but I see it only as part of the solution.
>
> The other thing is that when someone is working locally and does some work
> with Flink he should need to go out of it to play with other libraries.
> Isnt this important for the product success?
>
> Regards,
> Stavros
> On Tue, Feb 21, 2017 at 1:04 PM, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com> wrote:
>
> > Thank you all for your thoughts on the matter.
> >
> > Andrea brought up some further engine considerations that we need to
> > address in order to have a competitive ML engine on Flink.
> >
> > I'm happy to see many people willing to contribute to the development of
> ML
> > on Flink. The way I see it, there needs to be buy-in from the rest of the
> > community for such changes to go through.
> >
> > If then you are interested in helping out, tackling one of the issues
> > mentioned in my previous email or the ones mentioned by Andrea are the
> most
> > critical ones, as they require making changes to the core.
> >
> > If you want to take up one of those issues the best way is to start a
> > conversation on the list, and gauge the opinion of the community.
> >
> > Finally, as Stavros mentioned, we need to come up with an updated roadmap
> > for FlinkML that includes these issues.
> >
> > @Andrea, the idea of an online learning library for Flink has been
> broached
> > before, and this semester I have one Master student working on exactly
> > that. From my conversations with people in the industry however, almost
> > nobody uses online learning in production, at best models are updated
> every
> > 5 minutes. So the impact would probably not be very large.
> >
> > I would like to bring up again the topic of model serving that I think
> fits
> > the Flink use-case much better. Developing a system like Clipper [1] on
> top
> > of Flink could be one of the best ways to use Flink for ML.
> >
> > Regards,
> > Theodore
> >
> > [1]  Clipper: A Low-Latency Online Prediction Serving System -
> > https://arxiv.org/abs/1612.03079
> >
> > On Tue, Feb 21, 2017 at 12:10 AM, Andrea Spina <
> andrea.spina@radicalbit.io
> > >
> > wrote:
> >
> > > Hi all,
> > >
> > > Thanks Stavros for pushing forward the discussion which I feel really
> > > relevant.
> > >
> > > Since I'm approaching actively the community just right now and I
> haven't
> > > enough experience and such visibility around the Flink community, I'd
> > limit
> > > myself to share an opinion as a Flink user.
> > >
> > > I'm using Flink since almost a year along two different experiences,
> but
> > > I've bumped into the question "how to handle ML workloads and keep
> Flink
> > as
> > > the main engine?" in both cases. Then the first point raises in my
> mind:
> > > why
> > > do I need to adopt an extra system for purely ML purposes: how amazing
> > > could
> > > be to benefit the Flink engine as ML features provider and to avoid
> > paying
> > > the effort to maintain an additional engine? This thought links also
> > @Timur
> > > opinion: I believe that users would prefer way more a unified
> > architecture
> > > in this case. Even if a user want to use an external tool/library -
> > perhaps
> > > providing additional language support (e.g. R) - so that user should be
> > > capable to run it on top of Flink.
> > >
> > > Along my work with Flink I needed to implement some ML algorithms on
> both
> > > Flink and Spark and I often struggled with Flink performances: namely,
> I
> > > think (in the name of the bigger picture) we should first focus the
> > effort
> > > on solving some well-known Flink limitations as @theodore pinpointed.
> I'd
> > > like to highlight [1] and [2] which I find relevant. Since the
> community
> > > would decide to go ahead with FlinkML I believe fixing the above
> > described
> > > issues may be a good starting point. That would also definitely push
> > > forward
> > > some important integrations as Apache SystemML.
> > >
> > > Given all these points, I'm increasingly convinced that Online Machine
> > > Learning would be the real final objective and the more suitable goal
> > since
> > > we're talking about a real-time streaming engine and - from a real high
> > > point of view - I believe Flink would fit this topic in a more genuine
> > way
> > > than the batch case. We've a connector for Apache SAMOA, but it seems
> in
> > an
> > > early stage of development IMHO and not really active. If we want to
> make
> > > something within Flink instead, we need to speed up the design of some
> > > features (e.g. side inputs [3]).
> > >
> > > I really hope we can define a new roadmap by which we can finally push
> > > forward the topic. I will put my best to help in this way.
> > >
> > > Sincerely,
> > > Andrea
> > >
> > > [1] Add a FlinkTools.persist style method to the Data Set
> > > https://issues.apache.org/jira/browse/FLINK-1730
> > > [2] Only send data to each taskmanager once for broadcasts
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> > > 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
> > > [3] Side inputs - Evolving or static Filter/Enriching
> > >
> https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-
> > > MKQYN3m4/edit#
> > > http://apache-flink-mailing-list-archive.1008284.n3.
> > > nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-
> > > Streaming-API-td11529.html
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-flink-mailing-
> > > list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-
> > > roadmap-tp16040p16064.html
> > > Sent from the Apache Flink Mailing List archive. mailing list archive
> at
> > > Nabble.com.
> > >
> >
>
-- 

*Yours faithfully, *

*Kate Eri.*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message