flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Theodore Vasiloudis <theodoros.vasilou...@gmail.com>
Subject Re: Opening a discussion on FlinkML
Date Tue, 16 Feb 2016 08:23:36 GMT
So in regards to the original topic question it seems like most people
prefer option 2, which is
to keep the development of FlinkML inside the project, but try to bring in
new commiters.

A lot of other interesting points have been raised here as well, and if
people are interested in
working on things like the integration of Flink with other ML libraries
there is definitely room for contribution.

As always, the development of the library will be determined by what people
are willing to work on,
so I do invite people to spend some time with the codebase, and open
discussions here and tickets
about things they would like to see in the future.

Regards,
Theodore

On Mon, Feb 15, 2016 at 6:59 PM, Till Rohrmann <trohrmann@apache.org> wrote:

> I agree with Martin that the original topic of this thread was about how to
> keep FlinkML active so that new changes will be promptly merged. The things
> we want to implement is then up to the active contributors of FlinkML.
>
> Personally, I would prefer to keep FlinkML as part of Flink's main
> repository and to distribute the workload of reviewing and merging PRs
> across some more shoulders. At the moment, we only have few committers with
> a sufficiently strong background in ML to review PRs. But in order to
> review PRs one does not have to be necessarily a committer. I think it
> would already help considerably if other contributors could step up and
> give feedback to other contributors. Of course, this would also be a good
> way to become a committer.
>
> Out sourcing FlinkML into a repository which is still governed by the ASF
> would still require a committer to do the merging. Thus, apart from
> decoupling the release cycle, we wouldn't gain much. Considering the number
> of changes we merged over the past months, having a too long release cycle
> was never a problem.
>
> Forking FlinkML completely off would give us the flexibility to commit as
> we like. But it will come at the cost of losing visibility as a project. I
> doubt that many people would become aware of FlinkML if it is some project
> adding ML support to Flink.
>
> Cheers,
> Till
>
> On Sun, Feb 14, 2016 at 12:53 PM, Martin Neumann <mneumann@sics.se> wrote:
>
> > I think the focus of this discussion should be how we proceed not what to
> > do. The what comes from the committers anyway.
> >
> > There are several people who like to commit, including people from the
> > Streamline project. Having pull requests that are older than 6 Month is
> not
> > good for any project.
> > The main question is how can we develop the library further with high
> > standards but without creating a bottleneck that holds things back to
> much.
> >
> > In my opinion it would be best if we find enough resources to keep things
> > inside Flink. However if we have to depend on people who are
> > already stretched for time, splitting it out might be the better option.
> > (path 1 from Theos original mail)
> >
> > cheers Martin
> >
> >
> >
> >
> > On Fri, Feb 12, 2016 at 3:54 PM, Suneel Marthi <smarthi@apache.org>
> wrote:
> >
> > > On Fri, Feb 12, 2016 at 9:40 AM, Simone Robutti <
> > > simone.robutti@radicalbit.io> wrote:
> > >
> > > > @Suneel
> > > >
> > > > 1) Totally agree, as I wrote before.
> > > >
> > > > 2)I agree that support for PMML is premature but we shouldn't
> > > underestimate
> > > > the variety and complexity of the uses of ML models in the industry.
> > The
> > > > adoption of Flink, hopefully, will grow and reach less innovative
> > > realities
> > > > where Random Forests and SVMs are still the main algorithms in use.
> In
> > > > these same realities there are legacies that justify the use of PMML
> to
> > > > port models. Still, FlinkML is still in an early stage so as you
> said,
> > it
> > > > doesn't make sense to spend time right now on such a feature.
> > > >
> > >
> > > +1, as I mentioned earlier the PMML spec only supports classification
> and
> > > clustering (I last checked this in Aug 2015, pretty sure it would not
> > have
> > > changed since then); hence 'Yes' it has some limited uses; 'No' - its
> too
> > > premature to even talk about it given the present state of FlinkML.
> > >
> > > >
> > > > 3)This would be really interesting. How do you imagine that the
> > > integration
> > > > with a distributed processing engine would work?
> > > >
> > >
> > > I am not sure yet, we r still exploring this on Mahout project to add
> to
> > > Mahout-Samsara - most of the statistics and probabilistic modeling
> would
> > > then be supported by Figaro (Bayesian, MCMC etc) and hence can be
> > external
> > > to FlinkML.
> > >
> > > Figaro is Scala based. See https://github.com/p2t2/figaro
> > >
> > > I believe there are few other similar DSLs out there, need to dig up my
> > old
> > > emails.
> > >
> > > (Not sure if its ASLv2 License, need verification here)
> > >
> > >
> > > >
> > > > 5) Agree on this one too. To my knowledge it would be the best option
> > > > together with SAMOA (for the streaming part).
> > > >
> > >
> > > There's already Flink - Samoa integration in place IIRC.
> > >
> > >
> > > >
> > > > 2016-02-12 15:25 GMT+01:00 Suneel Marthi <smarthi@apache.org>:
> > > >
> > > > > My 2 cents as someone who's done ML over the years - having worked
> on
> > > > Oryx
> > > > > 2.0 and Mahout and having used Spark MlLib (read as "had no choice
> > due
> > > to
> > > > > strict workplace enforcement") and understands well their
> > limitations.
> > > > >
> > > > > 1. FlinkML in its present form seems like "do it like how Spark did
> > > it".
> > > > >
> > > > > 2. The recent discussion about PMML support in Flink to my mind is
> a
> > > > clear
> > > > > example of putting the cart before the horse.  Why are we even
> > talking
> > > > PMML
> > > > > when there ain't much ML algos in FlinkML?
> > > > >
> > > > > For a real good implementation of PMML and how its being used (with
> > > > jPMML),
> > > > > suggest look at the Oryx 2.0 project. The PMML implementation in
> Oryx
> > > 2.0
> > > > > predates Spark and is a clean example of separating PMML from the
> > > > > underlying framework (Spark or Flink).
> > > > >
> > > > > We have had PMML discussions on the Mahout project in the past, but
> > the
> > > > > idea never gained any traction in large part due to PMML spec
> > > limitations
> > > > > (mostly for clustering and classification algorithms) and the lack
> of
> > > > > adoption within the community.
> > > > >
> > > > > See the discussion here and specifically Ted Dunning's comment on
> > PMML
> > > -
> > > > >
> > > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E
> > > > >
> > > > > Most of the ML in practice (deployed in production) today are
> > > > Recommenders
> > > > > and Deep Learning - both of which are not supported by the PMML
> spec.
> > > > >
> > > > > 3. Leveraging a probabilistic programming language like Figaro
> might
> > > be a
> > > > > good way to go (just my thought) - that way most of the ML
> groundwork
> > > > would
> > > > > be external to Flink.
> > > > >
> > > > > 4. Within the Mahout community, we had been talking (and are
> working)
> > > on
> > > > > redoing the Samsara Distributed linear algebra framework to support
> > > Flink
> > > > > (in large part we realized that Flink is a better platform than the
> > > more
> > > > > popular one out there that Slim wouldn't wanna talk about :) ).
> > > > >
> > > > >  We should be having a release out in the next few weeks (depending
> > on
> > > > > committers' availability). It would be great if FlinkML had
> something
> > > > like
> > > > > it.
> > > > >
> > > > > There was a good audience to Sebastian's talk on this subject at
> > #FF15
> > > in
> > > > > October.
> > > > >
> > > > > 5. Its a good idea to add Flink support to H2O as Slim had
> suggested
> > > > > elsewhere in this thread.
> > > > >
> > > > >
> > > > > Thoughts?
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti <
> > > > > simone.robutti@radicalbit.io> wrote:
> > > > >
> > > > > > I will say my opinion as a person that have worked with SparkML
> and
> > > > will
> > > > > be
> > > > > > involved soon in the development of ML solutions on Flink.
> > > > > >
> > > > > > In these days I tried to track the evolution and development
of
> > > FlinkML
> > > > > and
> > > > > > I see a big critical point: FlinkML looks a lot like a
> placeholder
> > > for
> > > > > > commercial purposes but there's not enough investment and
> > commitment
> > > to
> > > > > > achieve an usable product. I did a few things with FlinkML coming
> > > from
> > > > > > SparkML and I can say that it's unsuitable for most of the common
> > use
> > > > > cases
> > > > > > covered by SparkML (that is not a good ML library at all in
terms
> > of
> > > > > > usability).
> > > > > >
> > > > > > So my question is: do we really need FlinkML? The roadmap looks
a
> > lot
> > > > > like
> > > > > > "Spark has SparkML so we MUST have a ML library too". This could
> be
> > > > > > reasonable if you aim at a fine-tuned library tailored on the
> > > specifics
> > > > > of
> > > > > > Flink that are different from Spark. This could be even better
if
> > you
> > > > > > developed an implementation of SGD that exploit the computational
> > > model
> > > > > of
> > > > > > Flink that, I think, could achieve a lot more compared to the
> > actual
> > > > > > implementation. This is a subject that I want to study better
> > before
> > > > > saying
> > > > > > more but I'm looking at better parallelization strategies for
> data
> > > and
> > > > > > models.
> > > > > >
> > > > > > Going back to FlinkML, do we really need to reimplement the
same
> > > > > workhorse
> > > > > > algorithms already implemented in SparkML, H2O, Mahout, SystemML,
> > > Weka,
> > > > > > Oryx and other distributed learning libraries? Is it really
> useful
> > at
> > > > > this
> > > > > > stage? Given the current resources of the project, wouldn't
it be
> > > more
> > > > > > reasonable to invest time and energy in integrating more mature
> > > > libraries
> > > > > > (and eventually rich tooling that would give a big advantage
over
> > the
> > > > > other
> > > > > > libraries)?
> > > > > >
> > > > > > I would like to comment on your proposals but my experience
in
> > > > > > collaborative open source development is way too limited to
form
> an
> > > > > > interesting opinion. Also I had no historical visibility on
the
> > > > > motivations
> > > > > > and discussions behind the development of FlinkML and I would
> like
> > > > > pointers
> > > > > > to read something on what is the shared vision on this part
of
> the
> > > > > project
> > > > > > so that I could join the discussion from now on.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Simone
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
> > > > > > theodoros.vasiloudis@gmail.com>:
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > > I would like to get a conversation started on how we plan
to
> move
> > > > > forward
> > > > > > > with FlinkML.
> > > > > > >
> > > > > > > Development on the library currently has been mostly dormant
> for
> > > the
> > > > > > past 6
> > > > > > > months,
> > > > > > >
> > > > > > > mainly I believe because of the lack of available committers
to
> > > > review
> > > > > > PRs.
> > > > > > >
> > > > > > > Last month we got together with Till and Marton and talked
> about
> > > how
> > > > we
> > > > > > > could try to
> > > > > > >
> > > > > > > solve this and ensure continued development of the library.
> > > > > > >
> > > > > > > We see 3 possible paths we could take:
> > > > > > >
> > > > > > >    1.
> > > > > > >
> > > > > > >    Externalize the library, creating a new repository under
the
> > > > Apache
> > > > > > >    Flink project. This decouples the development of FlinkML
> from
> > > the
> > > > > > Flink
> > > > > > >    release cycle, allowing us to move faster and incorporate
> new
> > > > > features
> > > > > > > as
> > > > > > >    they become available. As FlinkML is a library under
> > development
> > > > > tying
> > > > > > > it
> > > > > > >    to specific versions does not make much sense anyway.
The
> > > library
> > > > > > would
> > > > > > >    depend on the latest snapshot version of Flink. It would
> then
> > be
> > > > > > > possible
> > > > > > >    for the Flink distribution to cherry-pick parts of the
> library
> > > to
> > > > be
> > > > > > >    included with the core distribution.
> > > > > > >    2.
> > > > > > >
> > > > > > >    Keep the development under the main Flink project but
bring
> in
> > > new
> > > > > > >    committers. This would mean that the development remains
as
> is
> > > and
> > > > > is
> > > > > > > tied
> > > > > > >    to core Flink releases, but new worked should get merged
at
> > much
> > > > > more
> > > > > > >    regular intervals through the help of committers other
than
> > > Till.
> > > > > > Marton
> > > > > > >    Balassi has volunteered for that role and I hope that
more
> > might
> > > > > take
> > > > > > up
> > > > > > >    that role.
> > > > > > >    3. A third option is to fork FlinkML on a repository
on
> which
> > we
> > > > are
> > > > > > >    able to commit freely (again through PRs and reviews
of
> > course)
> > > > and
> > > > > > > merge
> > > > > > >    good parts back into the main repo once in a while.
This
> > allows
> > > > for
> > > > > > > faster
> > > > > > >    progress and more experimental work but obviously creates
> > > > > > fragmentation.
> > > > > > >
> > > > > > >
> > > > > > > I would like to hear your thoughts on these three options,
as
> > well
> > > as
> > > > > > > discuss other
> > > > > > >
> > > > > > > alternatives that could help move FlinkML forward.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Theodore
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message