flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <smar...@apache.org>
Subject Re: Opening a discussion on FlinkML
Date Fri, 12 Feb 2016 14:54:18 GMT
On Fri, Feb 12, 2016 at 9:40 AM, Simone Robutti <
simone.robutti@radicalbit.io> wrote:

> @Suneel
>
> 1) Totally agree, as I wrote before.
>
> 2)I agree that support for PMML is premature but we shouldn't underestimate
> the variety and complexity of the uses of ML models in the industry. The
> adoption of Flink, hopefully, will grow and reach less innovative realities
> where Random Forests and SVMs are still the main algorithms in use. In
> these same realities there are legacies that justify the use of PMML to
> port models. Still, FlinkML is still in an early stage so as you said, it
> doesn't make sense to spend time right now on such a feature.
>

+1, as I mentioned earlier the PMML spec only supports classification and
clustering (I last checked this in Aug 2015, pretty sure it would not have
changed since then); hence 'Yes' it has some limited uses; 'No' - its too
premature to even talk about it given the present state of FlinkML.

>
> 3)This would be really interesting. How do you imagine that the integration
> with a distributed processing engine would work?
>

I am not sure yet, we r still exploring this on Mahout project to add to
Mahout-Samsara - most of the statistics and probabilistic modeling would
then be supported by Figaro (Bayesian, MCMC etc) and hence can be external
to FlinkML.

Figaro is Scala based. See https://github.com/p2t2/figaro

I believe there are few other similar DSLs out there, need to dig up my old
emails.

(Not sure if its ASLv2 License, need verification here)


>
> 5) Agree on this one too. To my knowledge it would be the best option
> together with SAMOA (for the streaming part).
>

There's already Flink - Samoa integration in place IIRC.


>
> 2016-02-12 15:25 GMT+01:00 Suneel Marthi <smarthi@apache.org>:
>
> > My 2 cents as someone who's done ML over the years - having worked on
> Oryx
> > 2.0 and Mahout and having used Spark MlLib (read as "had no choice due to
> > strict workplace enforcement") and understands well their limitations.
> >
> > 1. FlinkML in its present form seems like "do it like how Spark did it".
> >
> > 2. The recent discussion about PMML support in Flink to my mind is a
> clear
> > example of putting the cart before the horse.  Why are we even talking
> PMML
> > when there ain't much ML algos in FlinkML?
> >
> > For a real good implementation of PMML and how its being used (with
> jPMML),
> > suggest look at the Oryx 2.0 project. The PMML implementation in Oryx 2.0
> > predates Spark and is a clean example of separating PMML from the
> > underlying framework (Spark or Flink).
> >
> > We have had PMML discussions on the Mahout project in the past, but the
> > idea never gained any traction in large part due to PMML spec limitations
> > (mostly for clustering and classification algorithms) and the lack of
> > adoption within the community.
> >
> > See the discussion here and specifically Ted Dunning's comment on PMML -
> >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E
> >
> > Most of the ML in practice (deployed in production) today are
> Recommenders
> > and Deep Learning - both of which are not supported by the PMML spec.
> >
> > 3. Leveraging a probabilistic programming language like Figaro might be a
> > good way to go (just my thought) - that way most of the ML groundwork
> would
> > be external to Flink.
> >
> > 4. Within the Mahout community, we had been talking (and are working) on
> > redoing the Samsara Distributed linear algebra framework to support Flink
> > (in large part we realized that Flink is a better platform than the more
> > popular one out there that Slim wouldn't wanna talk about :) ).
> >
> >  We should be having a release out in the next few weeks (depending on
> > committers' availability). It would be great if FlinkML had something
> like
> > it.
> >
> > There was a good audience to Sebastian's talk on this subject at #FF15 in
> > October.
> >
> > 5. Its a good idea to add Flink support to H2O as Slim had suggested
> > elsewhere in this thread.
> >
> >
> > Thoughts?
> >
> >
> >
> > On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti <
> > simone.robutti@radicalbit.io> wrote:
> >
> > > I will say my opinion as a person that have worked with SparkML and
> will
> > be
> > > involved soon in the development of ML solutions on Flink.
> > >
> > > In these days I tried to track the evolution and development of FlinkML
> > and
> > > I see a big critical point: FlinkML looks a lot like a placeholder for
> > > commercial purposes but there's not enough investment and commitment to
> > > achieve an usable product. I did a few things with FlinkML coming from
> > > SparkML and I can say that it's unsuitable for most of the common use
> > cases
> > > covered by SparkML (that is not a good ML library at all in terms of
> > > usability).
> > >
> > > So my question is: do we really need FlinkML? The roadmap looks a lot
> > like
> > > "Spark has SparkML so we MUST have a ML library too". This could be
> > > reasonable if you aim at a fine-tuned library tailored on the specifics
> > of
> > > Flink that are different from Spark. This could be even better if you
> > > developed an implementation of SGD that exploit the computational model
> > of
> > > Flink that, I think, could achieve a lot more compared to the actual
> > > implementation. This is a subject that I want to study better before
> > saying
> > > more but I'm looking at better parallelization strategies for data and
> > > models.
> > >
> > > Going back to FlinkML, do we really need to reimplement the same
> > workhorse
> > > algorithms already implemented in SparkML, H2O, Mahout, SystemML, Weka,
> > > Oryx and other distributed learning libraries? Is it really useful at
> > this
> > > stage? Given the current resources of the project, wouldn't it be more
> > > reasonable to invest time and energy in integrating more mature
> libraries
> > > (and eventually rich tooling that would give a big advantage over the
> > other
> > > libraries)?
> > >
> > > I would like to comment on your proposals but my experience in
> > > collaborative open source development is way too limited to form an
> > > interesting opinion. Also I had no historical visibility on the
> > motivations
> > > and discussions behind the development of FlinkML and I would like
> > pointers
> > > to read something on what is the shared vision on this part of the
> > project
> > > so that I could join the discussion from now on.
> > >
> > > Thanks,
> > >
> > > Simone
> > >
> > >
> > >
> > > 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
> > > theodoros.vasiloudis@gmail.com>:
> > >
> > > > Hello all,
> > > >
> > > > I would like to get a conversation started on how we plan to move
> > forward
> > > > with FlinkML.
> > > >
> > > > Development on the library currently has been mostly dormant for the
> > > past 6
> > > > months,
> > > >
> > > > mainly I believe because of the lack of available committers to
> review
> > > PRs.
> > > >
> > > > Last month we got together with Till and Marton and talked about how
> we
> > > > could try to
> > > >
> > > > solve this and ensure continued development of the library.
> > > >
> > > > We see 3 possible paths we could take:
> > > >
> > > >    1.
> > > >
> > > >    Externalize the library, creating a new repository under the
> Apache
> > > >    Flink project. This decouples the development of FlinkML from the
> > > Flink
> > > >    release cycle, allowing us to move faster and incorporate new
> > features
> > > > as
> > > >    they become available. As FlinkML is a library under development
> > tying
> > > > it
> > > >    to specific versions does not make much sense anyway. The library
> > > would
> > > >    depend on the latest snapshot version of Flink. It would then be
> > > > possible
> > > >    for the Flink distribution to cherry-pick parts of the library to
> be
> > > >    included with the core distribution.
> > > >    2.
> > > >
> > > >    Keep the development under the main Flink project but bring in new
> > > >    committers. This would mean that the development remains as is and
> > is
> > > > tied
> > > >    to core Flink releases, but new worked should get merged at much
> > more
> > > >    regular intervals through the help of committers other than Till.
> > > Marton
> > > >    Balassi has volunteered for that role and I hope that more might
> > take
> > > up
> > > >    that role.
> > > >    3. A third option is to fork FlinkML on a repository on which we
> are
> > > >    able to commit freely (again through PRs and reviews of course)
> and
> > > > merge
> > > >    good parts back into the main repo once in a while. This allows
> for
> > > > faster
> > > >    progress and more experimental work but obviously creates
> > > fragmentation.
> > > >
> > > >
> > > > I would like to hear your thoughts on these three options, as well as
> > > > discuss other
> > > >
> > > > alternatives that could help move FlinkML forward.
> > > >
> > > > Cheers,
> > > > Theodore
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message