Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C71C10444 for ; Sat, 24 Oct 2015 18:31:56 +0000 (UTC) Received: (qmail 39488 invoked by uid 500); 24 Oct 2015 18:31:55 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 39281 invoked by uid 500); 24 Oct 2015 18:31:55 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 39269 invoked by uid 99); 24 Oct 2015 18:31:55 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Oct 2015 18:31:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9A57D1A2BB2 for ; Sat, 24 Oct 2015 18:31:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.88 X-Spam-Level: X-Spam-Status: No, score=0.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_LIVE=1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id eWkATGbqa7tU for ; Sat, 24 Oct 2015 18:31:46 +0000 (UTC) Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com [209.85.213.182]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 8E1E942B60 for ; Sat, 24 Oct 2015 18:31:46 +0000 (UTC) Received: by igbhv6 with SMTP id hv6so34326582igb.0 for ; Sat, 24 Oct 2015 11:31:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=H7u0MxadfqiBZCViCbASXyh5FqrYUCHeFAA6tg92ziM=; b=IjwwBvipXfOOPFeBFhqqMTMOi/crGZcCHPYz/peSZS/lGyxp/twOrF81k2HM2FFAVF H1xzOsO9szcrg+GLI1qhPXLcOPW5XlpUkBjoLjVMQt0hbiSrIJ9271dUQlObD3ITqDYD WSct3lFzREm3iYR5qhr8puzYwY7ifIcNx9KOi+qN+A5xh5rRK9w7HTwqWUnPCfj8CLRz vVTO+n0yaZD2JefrbnrKkdQ8vdnOCZ3Ob42lban7eCTQT8sZQKicX3j5R2vZH/Km0d2F g3JTWN+Ri0Th45GxuhqkKCfmclCBLiJsiVjBJk3ssc7CxT+hjnyN7E8BWTbM1py9OITm SDJg== MIME-Version: 1.0 X-Received: by 10.50.117.102 with SMTP id kd6mr9988423igb.41.1445711506042; Sat, 24 Oct 2015 11:31:46 -0700 (PDT) Received: by 10.36.67.2 with HTTP; Sat, 24 Oct 2015 11:31:45 -0700 (PDT) In-Reply-To: References: Date: Sat, 24 Oct 2015 11:31:45 -0700 Message-ID: Subject: Re: [DISCUSS] SystemML Incubator Proposal From: Henry Saputra To: "general@incubator.apache.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I have one question about the proposal, it keep mentioning that it could run on "Hadoop or Spark", but technically Spark can run on Hadoop YARN. Was it trying to say it could be run in Hadoop YARN (maybe via MapReduce) or Spark? I would love to see if the execution abstraction is well enough defined to be able to run it on the others distributed framework like Flink or Tez (maybe via Crunch?) Thanks, Henry On Fri, Oct 23, 2015 at 4:34 PM, Luciano Resende wro= te: > We would like to start a discussion on accepting SystemML as an Apache > Incubator project. > > The proposal is available at : > https://wiki.apache.org/incubator/SystemM > > And it's contents is also copied below. > > Thanks in Advance for you time reviewing and providing feedback. > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > =3D SystemML =3D > > =3D=3D Abstract =3D=3D > > SystemML provides declarative large-scale machine learning (ML) that aims > at flexible specification of ML algorithms and automatic generation of > hybrid runtime plans ranging from single node, in-memory computations, to > distributed computations on Apache Hadoop and Apache Spark. ML algorithm= s > are expressed in an R-like syntax, that includes linear algebra primitive= s, > statistical functions, and ML-specific constructs. This high-level langua= ge > significantly increases the productivity of data scientists as it provide= s > (1) full flexibility in expressing custom analytics, and (2) data > independence from the underlying input formats and physical data > representations. Automatic optimization according to data characteristics > such as distribution on the disk file system, and sparsity as well as > processing characteristics in the distributed environment like number of > nodes, CPU, memory per node, ensures both efficiency and scalability. > > =3D=3D Proposal =3D=3D > > The goal of SystemML is to create a commercial friendly, scalable and > extensible machine learning framework for data scientists to create or > extend machine learning algorithms using a declarative syntax. The machin= e > learning framework enables data scientists to develop algorithms locally > without the need of a distributed cluster, and scale up and scale out the > execution of these algorithms to distributed Hadoop or Spark clusters. > > =3D=3D Background =3D=3D > > SystemML started as a research project in the IBM Almaden Research Center > around 2010 aiming to enable data scientists to develop machine learning > algorithms independent of data and cluster characteristics. > > =3D=3D Rationale =3D=3D > > SystemML enables the specification of machine learning algorithms using a > declarative machine learning (DML) language. DML includes linear algebra > primitives, statistical functions, and additional constructs. This > high-level language significantly increases the productivity of data > scientists as it provides (1) full flexibility in expressing custom > analytics and (2) data independence from the underlying input formats and > physical data representations. > > SystemML computations can be executed in a variety of different modes. It > supports single node in-memory computations and large-scale distributed > cluster computations. This allows the user to quickly prototype new > algorithms in local environments but automatically scale to large data > sizes as well without changing the algorithm implementation. > > Algorithms specified in DML are dynamically compiled and optimized based = on > data and cluster characteristics using rule-based and cost-based > optimization techniques. The optimizer automatically generates hybrid > runtime execution plans ranging from in-memory single-node execution to > distributed computations on Spark or Hadoop. This ensures both efficiency > and scalability. Automatic optimization reduces or eliminates the need to > hand-tune distributed runtime execution plans and system configurations. > > =3D=3D Initial Goals =3D=3D > > The initial goals to move SystemML to the Apache Incubator is to broaden > the community foster the contributions from data scientists to develop ne= w > machine learning algorithms and enhance the existing ones. Ultimately, th= is > may lead to the creation of an industry standard in specifying machine > learning algorithms. > > =3D=3D Current Status =3D=3D > > The initial code has been developed at the IBM Almaden Research Center in > California and has recently been made available in GitHub under the Apach= e > Software License 2.0. The project currently supports a single node (in > memory computation) as well as distributed computations utilizing Hadoop = or > Spark clusters. > > =3D=3D=3D Meritocracy =3D=3D=3D > > We plan to invest in supporting a meritocracy. We will discuss the > requirements in an open forum. Several companies have already expressed > interest in this project, and we intend to invite additional developers t= o > participate. We will encourage and monitor community participation so tha= t > privileges can be extended to those that contribute operating to the > standard of meritocracy that Apache emphasizes. > > =3D=3D=3D Community =3D=3D=3D > > The need for a generic scalable and declarative machine learning approach > in the open source is tremendous, so there is a potential for a very larg= e > community. We believe that SystemML=E2=80=99s extensible architecture, de= clarative > syntax, cost based optimizer and its alignment with Spark will further > encourage community participation not only in enhancing the infrastructur= e > but also speed up the creation of algorithms for a wide range of use > cases. We expect that over time SystemML will attract a large community. > > =3D=3D=3D Alignment =3D=3D=3D > > The initial committers strongly believe that a generic scalable and > declarative machine learning approach for machine learning will gain > broader adoption as an open source, community driven project, where the > community can contribute not only to the core components, but also to a > growing collection of algorithms which will leverage the optimizations an= d > ease of scaling in SystemML. Our hope is that the Apache Spark, Apache > Hadoop and other communities will find tremendous value in SystemML and > this will foster further collaboration between these projects furthering > the already existing integration points. > > =3D=3D Known Risks =3D=3D > > To-date, development has been sponsored by IBM and coordinated mostly by > the core team of researchers at the IBM Almaden Research Center. > > For SystemML to fully transition to an "Apache Way" governance model, it > needs to start embracing the meritocracy-centric way of growing the > community of contributors. > > =3D=3D=3D Orphaned Products =3D=3D=3D > > The SystemML developers and previous sponsor have a long-term interest in > use and maintenance of the code and there is also hope that growing a > diverse community around the project will become a guarantee against the > project becoming orphaned. We feel that it is also important to put forma= l > governance in place both for the project and the contributors as the > project expands. We feel ASF is the best location for this. > > =3D=3D=3D Inexperience with Open Source =3D=3D=3D > > The current SystemML set of contributors are very diverse regarding > participation in Open Source. While some initial members are experiencing > an open source project for the first time, others have been contributing > and mentoring various Apache and non-Apache open source projects. > > =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D > > SystemML currently receives substantial support from salaried developers. > However, they are all passionate about the project, and we are confident > that the project will continue even if no salaried developers contribute = to > the project. We are committed to recruiting additional committers includi= ng > non-salaried developers. > > =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D > > Currently, SystemML integrates with Apache Hadoop and Apache Spark as > underlying computational distributed runtimes. > > =3D=3D=3D An Excessive Fascination with the Apache Brand =3D=3D=3D > > SystemML solves a real need for generic scalable and declarative machine > learning approach for machine learning in the Apache Hadoop and Spark > ecosystems, something that has been addressed in a very ad hoc manner so > far by multiple Apache projects. Our rationale for developing SystemML as > an Apache project is detailed in the Rationale section. We believe that t= he > Apache brand and community process will help us attract more contributors > to this project, and help establish ubiquitous APIs. > > =3D=3D Documentation =3D=3D > > Documentation regarding SystemML is available in the current GitHub > repository https://github.com/SparkTC/systemml/tree/master/system-ml/docs= . > > =3D=3D Initial Source =3D=3D > > Initial source is available on GitHub under the Apache License 2.0 > > https://github.com/SparkTC/systemml > > =3D=3D Source and Intellectual Property Submission Plan =3D=3D > > We know of no legal encumbrances in the transfer of source code and right= s > to Apache. In fact, given the internal IBM due diligence performed on the > source code during open sourcing, we expect the code base to be free from > any IP issues. > > =3D=3D External Dependencies =3D=3D > > SystemML is written in Java and currently supports Apache Hadoop MapReduc= e > and Apache Spark runtimes. > > To the best of our knowledge, all dependencies of SystemML are distribute= d > under Apache compatible licenses. Upon acceptance to the incubator, we > would begin a thorough analysis of all transitive dependencies to verify > this fact and introduce license checking into the build and release proce= ss > (for instance integrating Apache Rat). > > Cryptography > N/A > > =3D=3D Required Resources =3D=3D > > =3D=3D=3D Mailing lists =3D=3D=3D > * private@sysml.incubator.apache.org (moderated subscriptions) > * commits@sysml.incubator.apache.org > * dev@sysml.incubator.apache.org > > =3D=3D=3D Git Repository =3D=3D=3D > * https://git-wip-us.apache.org/repos/asf/incubator-sysml.git > > =3D=3D=3D Issue Tracking =3D=3D=3D > * JIRA (SYSML) > > =3D=3D Initial Committers =3D=3D > > * Luciano Resende (lresende AT apache DOT org) > * Berthold Reinwald (reinwald AT us DOT ibm DOT com) > * Matthias Boehm (mboehm AT us DOT ibm DOT com) > * Shirish Tatikonda (statiko AT us DOT ibm DOT com) > * Niketan Pansare (npansar AT us DOT ibm DOT com) > * Prithviraj Sen (senp AT us DOT ibm DOT com) > * Alexandre V Evfimievski (evfimi AT us DOT ibm DOT com) > * Fred Reiss (frreiss AT us DOT ibm DOT com) > * Deron Eriksson (deron AT us DOT ibm DOT com) > * Arvind Surve (asurve AT us DOT ibm DOT com) > * Mike Dusenberry (mwdusenb AT us DOT ibm DOT com) > * Reynold Xin (rxin AT apache DOT org) > * Xiangrui Meng (meng AT apache DOT org) > * Joseph Bradley (jkbradley AT apache DOT org) > * Patrick Wendell (pwendell AT apache DOT org) > * Holden Karau (holden AT apache DOT org) > * DB Tsai (dbtsai AT apache DOT org) > > =3D=3D Affiliations =3D=3D > > * DataBricks: Reynold Xin, Xiangrui Meng, Joseph Bradley, Patrick Wendel= l > * Alpine: Holden Karau > * Netflix: DB Tsai > * IBM: Luciano Resende, Berthold Reinwald, Matthias Boehm, Shirish > Tatikonda, Niketan Pansare, Prithviraj Sen, Alexandre V Evfimievski, Fred > Reiss, Deron Eriksson, Arvind Surve and Mike Dusenberry. > > =3D=3D Sponsors =3D=3D > > =3D=3D=3D Champion =3D=3D=3D > * Luciano Resende > > =3D=3D=3D Nominated Mentors =3D=3D=3D > * Luciano Resende > * Reynold Xin > * Patrick Wendell > * Rich Bowen > > =3D=3D=3D Sponsoring Entity =3D=3D=3D > We would like to propose the Apache Incubator to sponsor this project. > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org