incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Westmann" <ti...@apache.org>
Subject Re: [VOTE] Accept SystemML into Apache Incubator
Date Wed, 28 Oct 2015 15:39:01 GMT
+1

On 27 Oct 2015, at 21:52, Luciano Resende wrote:

> After initial discussion, please vote on the acceptance of SystemML
> Project for incubation at the Apache Incubator. The full proposal is
> available at the end of this message and on the wiki at :
>
> https://wiki.apache.org/incubator/SystemML
> <http://wiki.apache.org/incubator/Nuvem>
>
> Please cast your votes:
>
> [ ] +1, bring SystemML into Incubator
> [ ] +0, I don't care either way
> [ ] -1, do not bring SystemML into Incubator, because...
>
> The vote is open for the next 72 hours and only votes from the
> Incubator PMC are binding.
>
>
> = SystemML =
>
> == Abstract ==
>
> SystemML provides declarative large-scale machine learning (ML) that 
> aims
> at flexible specification of ML algorithms and automatic generation of
> hybrid runtime plans ranging from single node, in-memory computations, 
> to
> distributed computations on Apache Hadoop MapReduce and  Apache Spark. 
> ML
> algorithms are expressed in an R-like syntax, that includes linear 
> algebra
> primitives, statistical functions, and ML-specific constructs. This
> high-level language significantly increases the productivity of data
> scientists as it provides (1) full flexibility in expressing custom
> analytics, and (2) data independence from the underlying input formats 
> and
> physical data representations. Automatic optimization according to 
> data
> characteristics such as distribution on the disk file system, and 
> sparsity
> as well as processing characteristics in the distributed environment 
> like
> number of nodes, CPU, memory per node, ensures both efficiency and
> scalability.
>
> == Proposal ==
>
> The goal of SystemML is to create a commercial friendly, scalable and
> extensible machine learning framework for data scientists to create or
> extend machine learning algorithms using a declarative syntax. The 
> machine
> learning framework enables data scientists to develop algorithms 
> locally
> without the need of a distributed cluster, and scale up and scale out 
> the
> execution of these algorithms to distributed Apache Hadoop MapReduce 
> or
> Apache Spark clusters.
>
> == Background ==
>
> SystemML started as a research project in the IBM Almaden Research 
> Center
> around 2007 aiming to enable data scientists to develop machine 
> learning
> algorithms independent of data and cluster characteristics.
>
> == Rationale ==
>
> SystemML enables the specification of machine learning algorithms 
> using a
> declarative machine learning (DML) language. DML includes linear 
> algebra
> primitives, statistical functions, and additional constructs. This
> high-level language significantly increases the productivity of data
> scientists as it provides (1) full flexibility in expressing custom
> analytics and (2) data independence from the underlying input formats 
> and
> physical data representations.
>
> SystemML computations can be executed in a variety of different modes. 
> It
> supports single node in-memory computations and large-scale 
> distributed
> cluster computations. This allows the user to quickly prototype new
> algorithms in local environments but automatically scale to large data
> sizes as well without changing the algorithm implementation.
>
> Algorithms specified in DML are dynamically compiled and optimized 
> based on
> data and cluster characteristics using rule-based and cost-based
> optimization techniques. The optimizer automatically generates hybrid
> runtime execution plans ranging from in-memory single-node execution 
> to
> distributed computations on Apache Spark or Apache Hadoop MapReduce. 
> This
> ensures both efficiency and scalability. Automatic optimization 
> reduces or
> eliminates the need to hand-tune distributed runtime execution plans 
> and
> system configurations.
>
> == Initial Goals ==
>
> The initial goals to move SystemML to the Apache Incubator is to 
> broaden
> the community foster the contributions from data scientists to develop 
> new
> machine learning algorithms and enhance the existing ones. Ultimately, 
> this
> may lead to the creation of an industry standard in specifying machine
> learning algorithms.
>
> == Current Status ==
>
> The initial code has been developed at the IBM Almaden Research Center 
> in
> California and has recently been made available in GitHub under the 
> Apache
> Software License 2.0. The project currently supports a single node (in
> memory computation) as well as distributed computations utilizing 
> Apache
> Hadoop MapReduce or Apache Spark clusters.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already 
> expressed
> interest in this project, and we intend to invite additional 
> developers to
> participate. We will encourage and monitor community participation so 
> that
> privileges can be extended to those that contribute operating to the
> standard of meritocracy that Apache emphasizes.
>
> === Community ===
>
> The need for a generic scalable and declarative machine learning 
> approach
> in the open source is tremendous, so there is a potential for a very 
> large
> community. We believe that SystemML’s extensible architecture, 
> declarative
> syntax, cost based optimizer and its alignment with Spark will further
> encourage community participation not only in enhancing the 
> infrastructure
> but also speed up the creation of algorithms for a wide range of use
> cases.  We expect that over time SystemML will attract a large 
> community.
>
> === Alignment ===
>
> The initial committers strongly believe that a generic scalable and
> declarative machine learning approach for machine learning will gain
> broader adoption as an open source, community driven project, where 
> the
> community can contribute not only to the core components, but also to 
> a
> growing collection of algorithms which will leverage the optimizations 
> and
> ease of scaling in SystemML. Our hope is that the Apache Spark, Apache
> Hadoop and other communities will find tremendous value in SystemML 
> and
> this will foster further collaboration between these projects 
> furthering
> the already existing integration points.
>
> == Known Risks ==
>
> To-date, development has been sponsored by IBM and coordinated mostly 
> by
> the core team of researchers at the IBM Almaden Research Center.
>
> For SystemML to fully transition to an "Apache Way" governance model, 
> it
> needs to start embracing the meritocracy-centric way of growing the
> community of contributors.
>
> === Orphaned Products ===
>
> The SystemML developers and previous sponsor have a long-term interest 
> in
> use and maintenance of the code and there is also hope that growing a
> diverse community around the project will become a guarantee against 
> the
> project becoming orphaned. We feel that it is also important to put 
> formal
> governance in place both for the project and the contributors as the
> project expands. We feel ASF is the best location for this.
>
> === Inexperience with Open Source ===
>
> The current SystemML set of contributors are very diverse regarding
> participation in Open Source. While some initial members are 
> experiencing
> an open source project for the first time, others have been 
> contributing
> and mentoring various Apache and non-Apache open source projects.
>
> === Reliance on Salaried Developers ===
>
> SystemML currently receives substantial support from salaried 
> developers.
> However, they are all passionate about the project, and we are 
> confident
> that the project will continue even if no salaried developers 
> contribute to
> the project. We are committed to recruiting additional committers 
> including
> non-salaried developers.
>
>
> === Relationships with Other Apache Products ===
>
> Currently, SystemML integrates with Apache Hadoop MapReduce and Apache
> Spark as underlying computational distributed runtimes.
>
> === An Excessive Fascination with the Apache Brand ===
>
> SystemML solves a real need for generic scalable and declarative 
> machine
> learning approach for machine learning in the Apache Hadoop and Spark
> ecosystems, something that has been addressed in a very ad hoc manner 
> so
> far by multiple Apache projects. Our rationale for developing SystemML 
> as
> an Apache project is detailed in the Rationale section. We believe 
> that the
> Apache brand and community process will help us attract more 
> contributors
> to this project, and help establish ubiquitous APIs.
>
>
> == Documentation ==
>
> Documentation regarding SystemML is available in the current GitHub
> repository 
> https://github.com/SparkTC/systemml/tree/master/system-ml/docs.
>
> == Initial Source ==
>
> Initial source is available on GitHub under the Apache License 2.0
>
> https://github.com/SparkTC/systemml
>
> == Source and Intellectual Property Submission Plan ==
>
> We know of no legal encumbrances in the transfer of source code and 
> rights
> to Apache. In fact, given the internal IBM due diligence performed on 
> the
> source code during open sourcing, we expect the code base to be free 
> from
> any IP issues.
>
> == External Dependencies ==
>
> SystemML is written in Java and currently supports Apache Hadoop 
> MapReduce
> and Apache Spark runtimes.
>
> To the best of our knowledge, all dependencies of SystemML are 
> distributed
> under Apache compatible licenses. Upon acceptance to the incubator, we
> would begin a thorough analysis of all transitive dependencies to 
> verify
> this fact and introduce license checking into the build and release 
> process
> (for instance integrating Apache Rat).
>
> Cryptography
> N/A
>
> == Required Resources ==
>
> === Mailing lists ===
>    * private@sysml.incubator.apache.org (moderated subscriptions)
>    * commits@sysml.incubator.apache.org
>    * dev@sysml.incubator.apache.org
>
> === Git Repository ===
>    * https://git-wip-us.apache.org/repos/asf/incubator-sysml.git
>
> === Issue Tracking ===
>    * JIRA (SYSML)
>
> == Initial Committers ==
>
> * Luciano Resende (lresende AT apache DOT org)
> * Berthold Reinwald (reinwald AT us DOT ibm DOT com)
> * Matthias Boehm (mboehm AT us DOT ibm DOT com)
> * Shirish Tatikonda (statiko AT us DOT ibm DOT com)
> * Niketan Pansare (npansar AT us DOT ibm DOT com)
> * Prithviraj Sen (senp AT us DOT ibm DOT com)
> * Alexandre V Evfimievski (evfimi AT us DOT ibm DOT com)
> * Fred Reiss (frreiss AT us DOT ibm DOT com)
> * Deron Eriksson (deron AT us DOT ibm DOT com)
> * Arvind Surve (asurve AT us DOT ibm DOT com)
> * Mike Dusenberry (mwdusenb AT us DOT ibm DOT com)
> * Reynold Xin   (rxin AT apache DOT org)
> * Xiangrui Meng (meng AT apache DOT org)
> * Joseph Bradley (jkbradley AT apache DOT org)
> * Patrick Wendell (pwendell AT apache DOT org)
> * Holden Karau (holden AT apache DOT org)
> * DB Tsai (dbtsai AT apache DOT org)
>
> == Affiliations ==
>
> * DataBricks: Reynold Xin, Xiangrui Meng, Joseph Bradley, Patrick 
> Wendell
> * Netflix: DB Tsai
> * IBM: Luciano Resende, Berthold Reinwald, Matthias Boehm, Shirish
> Tatikonda, Niketan Pansare, Prithviraj Sen, Alexandre V Evfimievski, 
> Fred
> Reiss, Deron Eriksson, Arvind Surve, Mike Dusenberry and Holden Karau.
>
> == Sponsors ==
>
> === Champion ===
> * Luciano Resende
>
> === Nominated Mentors ===
> * Luciano Resende
> * Reynold Xin
> * Patrick Wendell
> * Rich Bowen
>
> === Sponsoring Entity ===
> We would like to propose the Apache Incubator to sponsor this project.
>
>
> -- 
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message