spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <>
Subject Re: MLlib mission and goals
Date Tue, 24 Jan 2017 01:07:10 GMT
Along the lines of #1:  the spark packages seemed to have had a good start
about two years ago: but now there are not more than a handful in general
use - e.g. databricks CSV.
When the available packages are browsed the majority are incomplete, empty,
unmaintained, or unclear.

Any ideas on how to resurrect spark packages in a way that there will be
sufficient adoption for it to be meaningful?

2017-01-23 17:03 GMT-08:00 Joseph Bradley <>:

> This thread is split off from the "Feedback on MLlib roadmap process
> proposal" thread for discussing the high-level mission and goals for
> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
> lead to huge decisions.
> Copying from the previous thread:
> *Seth:*
> """
> I would love to hear some discussion on the higher level goal of Spark
> MLlib (if this derails the original discussion, please let me know and we
> can discuss in another thread). The roadmap does contain specific items
> that help to convey some of this (ML parity with MLlib, model persistence,
> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
> often see PRs for brand new algorithms which are sometimes rejected and
> sometimes not. Do we aim to keep implementing more and more algorithms? Or
> is our focus really, now that we have a reasonable library of algorithms,
> to simply make the existing ones faster/better/more robust? Should we aim
> to make interfaces that are easily extended for developers to easily
> implement their own custom code (e.g. custom optimization libraries), or do
> we want to restrict things to out-of-the box algorithms? Should we focus on
> more flexible, general abstractions like distributed linear algebra?
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
> """
> *Mingjie:*
> """
> +1 general abstractions like distributed linear algebra.
> """
> I'll add my thoughts, starting with our past *t**rajectory*:
> * Initially, MLlib was mainly trying to build a set of core algorithms.
> * Two years ago, the big effort was adding Pipelines.
> * In the last year, big efforts have been around completing Pipelines and
> making the library more robust.
> I agree with Seth that a few *immediate goals* are very clear:
> * feature parity for DataFrame-based API
> * completing and improving testing for model persistence
> * Python, R parity
> *In the future*, it's harder to say, but if I had to pick my top 2 items,
> I'd list:
> *(1) Making MLlib more extensible*
> It will not be feasible to support a huge number of algorithms, so
> allowing users to customize their ML on Spark workflows will be critical.
> This is IMO the most important thing we could do for MLlib.
> Part of this could be building a healthy community of Spark Packages, and
> we will need to make it easier for users to write their own algorithms and
> packages to facilitate this.  Part of this could be allowing users to
> customize existing algorithms with custom loss functions, etc.
> *(2) Consistent improvements to core algorithms*
> A less exciting but still very important item will be constantly improving
> the core set of algorithms in MLlib. This could mean speed, scaling,
> robustness, and usability for the few algorithms which cover 90% of use
> cases.
> There are plenty of other possibilities, and it will be great to hear the
> community's thoughts!
> Thanks,
> Joseph
> --
> Joseph Bradley
> Software Engineer - Machine Learning
> Databricks, Inc.
> [image:] <>

View raw message