spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: MLlib mission and goals
Date Tue, 24 Jan 2017 11:03:39 GMT
I also agree with Joseph and Sean.
With respect to spark-packages. I think the issue is that you have to manually add it, although
it basically fetches the package from Maven Central (or custom upload).

From an organizational perspective there are other issues. E.g. You have to download it from
the internet instead of using an artifact repository within the enterprise. You do not want
users to download arbitrarily packages from the Internet into a production cluster. You also
want to make sure that they do not use outdated or snapshot versions, that you have control
over dependencies, licenses etc.

Currently I do not see that big artifact repository managers will support spark packages anytime
soon. I also do not see it from the big Hadoop distributions.

> On 24 Jan 2017, at 11:37, Sean Owen <> wrote:
> My $0.02, which shouldn't be weighted too much.
> I believe the mission as of Spark ML has been to provide the framework, and then implementation
of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box,
in a pretty well-supported and tested way.
> It's not a goal to support an arbitrarily large collection of algorithms because each
one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors
tend to skew academic, produce worse code, and don't stick around to maintain it. 
> The project is already generally quite overloaded; I don't know if there's bandwidth
to even cover the current scope. While 'the basics' is a subjective label, de facto, I think
we'd have to define it as essentially "what we already have in place" for the foreseeable
> That the bits on aren't so hot is not a problem but a symptom. Would
these really be better in the core project?
> And, or: I entirely agree with Joseph's take.
>> On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <> wrote:
>> This thread is split off from the "Feedback on MLlib roadmap process proposal" thread
for discussing the high-level mission and goals for MLlib.  I hope this thread will collect
feedback and ideas, not necessarily lead to huge decisions.
>> Copying from the previous thread:
>> Seth:
>> """
>> I would love to hear some discussion on the higher level goal of Spark MLlib (if
this derails the original discussion, please let me know and we can discuss in another thread).
The roadmap does contain specific items that help to convey some of this (ML parity with MLlib,
model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We
often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do
we aim to keep implementing more and more algorithms? Or is our focus really, now that we
have a reasonable library of algorithms, to simply make the existing ones faster/better/more
robust? Should we aim to make interfaces that are easily extended for developers to easily
implement their own custom code (e.g. custom optimization libraries), or do we want to restrict
things to out-of-the box algorithms? Should we focus on more flexible, general abstractions
like distributed linear algebra?
>> I was not involved in the project in the early days of MLlib when this discussion
may have happened, but I think it would be useful to either revisit it or restate it here
for some of the newer developers.
>> """
>> Mingjie:
>> """
>> +1 general abstractions like distributed linear algebra.
>> """
>> I'll add my thoughts, starting with our past trajectory:
>> * Initially, MLlib was mainly trying to build a set of core algorithms.
>> * Two years ago, the big effort was adding Pipelines.
>> * In the last year, big efforts have been around completing Pipelines and making
the library more robust.
>> I agree with Seth that a few immediate goals are very clear:
>> * feature parity for DataFrame-based API
>> * completing and improving testing for model persistence
>> * Python, R parity
>> In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:
>> (1) Making MLlib more extensible
>> It will not be feasible to support a huge number of algorithms, so allowing users
to customize their ML on Spark workflows will be critical.  This is IMO the most important
thing we could do for MLlib.
>> Part of this could be building a healthy community of Spark Packages, and we will
need to make it easier for users to write their own algorithms and packages to facilitate
this.  Part of this could be allowing users to customize existing algorithms with custom loss
functions, etc.
>> (2) Consistent improvements to core algorithms
>> A less exciting but still very important item will be constantly improving the core
set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for
the few algorithms which cover 90% of use cases.
>> There are plenty of other possibilities, and it will be great to hear the community's
>> Thanks,
>> Joseph
>> -- 
>> Joseph Bradley
>> Software Engineer - Machine Learning
>> Databricks, Inc.

View raw message