Mailing-List: contact general-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of hitesh@hortonworks.com
 designates 209.85.220.50 as permitted sender)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1085)
Subject: Re: [VOTE] Apache Spark for the Incubator
From: Hitesh Shah <hitesh@hortonworks.com>
In-Reply-To: <CDD80F64.D5F9D%chris.a.mattmann@jpl.nasa.gov>
Date: Sat, 8 Jun 2013 00:25:49 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <11517A04-1F90-43C2-83E0-7FD2EBB0A74B@hortonworks.com>
References: <CDD80F64.D5F9D%chris.a.mattmann@jpl.nasa.gov>
To: general@incubator.apache.org

+1 (non-binding)

-- Hitesh

On Jun 7, 2013, at 10:34 PM, Mattmann, Chris A (398J) wrote:

> Hi Folks,
>=20
> OK discussion has died down, time to VOTE to accept Spark into the
> Apache Incubator. I'll let the VOTE run for at least a week.
>=20
> So far I've heard +1s from the following folks, so no need for them
> to VOTE again unless they want to change their VOTE:
>=20
> +1
>=20
> Chris Mattmann*
> Konstantin Boudnik
> Henry Saputra*
> Reynold Xin
> Pei Chen
> Roman Shaposhnik*
> Suresh Marru*
>=20
> * -indicates IPMC
>=20
> [ ] +1 Accept Spark into the Apache Incubator.
> [ ] +0 Don't care.
> [ ] -1 Don't accept Spark into the Apache Incubator because..
>=20
> Proposal text is below.
>=20
> =3D=3D=3D Abstract =3D=3D=3D
> Spark is an open source system for large-scale data analysis on =
clusters.
>=20
> =3D=3D=3D Proposal =3D=3D=3D
> Spark is an open source system for fast and flexible large-scale data
> analysis. Spark provides a general purpose runtime that supports
> low-latency execution in several forms. These include interactive
> exploration of very large datasets, near real-time stream processing, =
and
> ad-hoc SQL analytics (through higher layer extensions). Spark =
interfaces
> with HDFS, HBase, Cassandra and several other storage storage layers, =
and
> exposes APIs in Scala, Java and Python.
> Background
> Spark started as U.C. Berkeley research project, designed to =
efficiently
> run machine learning algorithms on large datasets. Over time, it has
> evolved into a general computing engine as outlined above. Spark=B9s
> developer community has also grown to include additional institutions,
> such as universities, research labs, and corporations. Funding has =
been
> provided by various institutions including the U.S. National Science
> Foundation, DARPA, and a number of industry sponsors. See:
> https://amplab.cs.berkeley.edu/sponsors/ for full details.
>=20
> =3D=3D=3D Rationale =3D=3D=3D
> As the number of contributors to Spark has grown, we have sought for a
> long-term home for the project, and we believe the Apache foundation =
would
> be a great fit. Spark is a natural fit for the Apache foundation: =
Spark
> already interoperates with several existing Apache projects (HDFS, =
HBase,
> Hive, Cassandra, Avro and Flume to name a few). The Spark team is =
familiar
> with the Apache process and and subscribes to the Apache mission - the
> team includes multiple Apache committers already. Finally, joining =
Apache
> will help coordinate the development effort of the growing number of
> organizations which contribute to Spark.
>=20
> =3D=3D Initial Goals =3D=3D
> The initial goals will most likely be to move the existing codebase to
> Apache and integrate with the Apache development process. Furthermore, =
we
> plan for incremental development, and releases along with the Apache
> guidelines.
>=20
> =3D=3D=3D Current Status =3D=3D=3D
> =3D=3D Meritocracy =3D=3D
> The Spark project already operates on meritocratic principles. Today,
> Spark has several developers and has accepted multiple major patches =
from
> outside of U.C. Berkeley. While this process has remained mostly =
informal
> (we do not have an official committer list), an implicit organization
> exists in which individuals who contribute major components act as
> maintainers for those modules. If accepted, the Spark project would
> include several of these participants as committers from the onset. We
> will work to identify all committers and PPMC members for the project =
and
> to operate under the ASF meritocratic principles.
>=20
> =3D=3D=3D Community =3D=3D=3D
> Acceptance into the Apache foundation would bolster the already strong
> user and developer community around Spark. That community includes =
dozens
> of contributors from several institutions, a meetup group with several
> hundred members, and an active mailing list composed of hundreds of =
users.
> Core Developers
> The core developers of our project are listed in our contributors and
> initial PPMC below. Though many exist at UC Berkeley, there is a
> representative cross sampling of other organizations including =
Quantifind,
> Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends.
>=20
>=20
> =3D=3D=3D Alignment =3D=3D=3D
> Our proposed effort aligns with several ongoing BIGDATA and U.S. =
National
> priority funding interests including the NSF and its Expeditions =
program,
> and the DARPA XDATA project. Our industry partners and collaborators =
are
> well aligned with our code base.
>=20
> There are also a number of related Apache projects and dependencies, =
that
> will be mentioned in the Relationships with Other Apache products =
section.
>=20
> =3D=3D Known Risks =3D=3D
>=20
> =3D=3D=3D Orphaned Products =3D=3D=3D
> Given the current level of investment in Spark - the risk of the =
project
> being abandoned is minimal. There are several constituents who are =
highly
> incentivized to continue development. The U.C. Berkeley AMPLab relies =
on
> Spark as a platform for a large number of long-term research projects.
> Several companies have build verticalized products which are tightly
> dependent on Spark. Other companies have devoted significant internal
> infrastructure investment in Spark.
>=20
> =3D=3D=3D Inexperience with Open Source =3D=3D=3D
> Spark has existed as a healthy open source project for several years.
> During that time, Matei and others have curated an open-source =
community
> successfully, attracting developers from a diverse group of companies
> including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, =
and
> Webtrends.=20
>=20
> =3D=3D=3D Homogenous Developers =3D=3D=3D
> The initial list of committers includes developers from several
> institutions, including Quantifind, Microsoft, Yahoo!, ClearStory =
Data,
> Bizo, Intel, and Webtrends.
>=20
> =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D
> Like most open source projects, Spark receives a substantial support =
from
> salaried developers. A large fraction of Spark development is =
supported by
> graduate students at U.C. Berkeley in the course of research degrees -
> this is more a =B3volunteer=B2 relationship, since in most cases =
students
> contribute vastly more than is necessary to immediately support =
research.
> In addition, those working from within corporations often devote =
=B3after
> hours=B2 or spare time in the project - and these come from several
> organizations. We will work to ensure that the ability for the project =
to
> continuously be stewarded and to proceed forward independent of =
salaried
> developers is continued.
>=20
>=20
> =3D=3D=3D Relationship with Other Apache Products =3D=3D=3D
> Spark inter-operates with several existing Apache products by =
supporting
> them as storage layers: Apache Cassandra, Apache HBase, and Apache =
Hadoop
> (HDFS). It also uses several Apache components internally including =
Apache
> Maven and several Apache Commons libraries. Finally, Shark (a higher =
layer
> framework built on Spark) inter-operates with Apache Hive. We will =
explore
> the relationship between Spark and Apache Gora, which also provides
> in-memory object storage (Champion Mattmann was the Champion for Apace
> Gora so we expect alignment and cross pollination between our =
efforts).
>=20
> Spark offers an alternative computation engine to Apache Hadoop
> (MapReduce). Unlike MapReduce, Spark is designed for lower-latency and
> interactive workloads. This makes the projects complimentary: many =
users
> run MapReduce and Spark side-by-side.
>=20
> =3D=3D=3D A Excessive Fascination with the Apache Brand =3D=3D=3D
> Spark is already a healthy and relatively well known open source =
project.
> This proposal is not for the purpose of generating publicity. Rather, =
the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>=20
> =3D=3D=3D Documentation =3D=3D=3D
> The reader will find these websites highly relevant:
> * Spark website: http://spark-project.org/
> * Spark documentation: http://spark-project.org/documentation/
> * Issue tracking: https://spark-project.atlassian.net/
> * Codebase: https://github.com/mesos/spark
> * User group: https://groups.google.com/group/spark-users
>=20
> =3D=3D Initial Source =3D=3D
> The Spark codebase is currently hosted on Github:
> https://github.com/mesos/spark. This is the exact codebase that we =
would
> migrate to the Apache foundation.
> Source and Intellectual Property Submission Plan
> Currently, the Spark codebase is distributed under a BSD license. The =
vast
> majority of code has copyright held by the University of California. =
Upon
> entering Apache, Spark will migrate to an Apache License with all
> copyright assigned to the Apache Foundation. The University of =
California
> will transfer all copyright to the Apache Foundation. In certain cases
> where individuals hold copyright, we will have individuals sign over
> copyright to the Apache foundation as well.
>=20
> Going forward, all commits would assign copyright directly to the =
Apache
> foundation through our signed Individual Contributor License =
Agreements
> for all initial committers on the project.
>=20
>=20
> =3D=3D External Dependencies =3D=3D
> To the best of our knowledge, all dependencies of Spark are =
distributed
> under Apache compatible licenses. Upon acceptance to the incubator, we
> would begin a thorough analysis of all transitive dependencies to =
verify
> this fact and introduce license checking into the build and release
> process (for instance integrating Apache Rat).
>=20
> =3D=3D Required Resources =3D=3D
> =3D=3D=3D Mailing list =3D=3D=3D
> We will migrate the existing Spark mailing lists as follows:
>=20
> * spark-users@googlegroups --> users@spark.incubator.apache.org
> * spark-developers@googlegroups --> dev@spark.incubator.apache.org
> * spark-commits are hosted on Github, so we would request
> commits@spark.incubator.apache.org
>=20
> The latter is to be consistent with the new PIAO naming scheme for
> podlings.
>=20
> =3D=3D=3D Source control =3D=3D=3D
> The Spark team would like to use Git for source control, due to our
> current use of Git.
> We request a writeable Git repo for Spark, and mirroring to be set up =
to
> Github through INFRA. Champion Mattmann can assist with creating INFRA
> tickets for this.
>=20
> =3D=3D=3D Issue Tracking =3D=3D=3D
> Spark currently uses a hosted JIRA deployment for issue tracking. We =
will
> migrate to the Apache JIRA.
> http://issues.apache.org/jira/browse/SPARK
>=20
> =3D=3D Initial Committers =3D=3D
> * Matei Zaharia <matei@apache.org>
> * Ankur Dave <ankurdave@gmail.com>
> * Tathagata Das <tdas@eecs.berkeley.edu>
> * Haoyuan Li <haoyuan@cs.berkeley.edu>
> * Josh Rosen <joshrosen@cs.berkeley.edu>
> * Reynold Xin <rxin@cs.berkeley.edu>
> * Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
> * Mosharaf Chowdhury <mosharaf@cs.berkeley.edu>
> * Charles Reiss <charles@eecs.berkeley.edu>
> * Andy Konwinski <andykonwinski@gmail.com>
> * Patrick Wendell <pwendell@eecs.berkeley.edu>
> * Imran Rashid <imran@quantifind.com>
> * Ryan LeCompte <lecompte@gmail.com>
> * Ravi Pandya <ravip@exchange.microsoft.com>
> * Ram Sriharsha <harshars@yahoo-inc.com>
> * Robert Evans <evans@yahoo-inc.com>
> * Mridul Muralidharan <mridulm@yahoo-inc.com>
> * Thomas Dudziak <tomdz@clearstorydata.com>
> * Mark Hamstra <mark@clearstorydata.com>
> * Stephen Haberman <stephen.haberman@gmail.com>
> * Jason Dai <jason.dai@intel.com>
> * Shane Huang <shannie.huang@gmail.com>
> * Andrew xia <xiajunluan@gmail.com>
> * Nick Pentreath <nick.pentreath@gmail.com>
> * Sean McNamara <sean.mcnamara@webtrends.com>
>=20
> =3D=3D Affiliations =3D=3D
> The initial committers are from nine organizations: UC Berkeley,
> Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Mxit and
> Webtrends.
>=20
> * Matei Zaharia (UCB)
> * Ankur Dave (UCB)
> * Tathagata Das (UCB)
> * Haoyuan Li (UCB)
> * Josh Rosen (UCB)
> * Reynold Xin (UCB)
> * Shivaram Venkataraman (UCB)
> * Mosharaf Chowdhury (UCB)
> * Charles Reiss (UCB)
> * Andy Konwinski (UCB)
> * Patrick Wendell (UCB)
> * Imran Rashid (Quantifind)
> * Ryan LeCompte (Quantifind)
> * Ravi Pandya (Microsoft)
> * Ram Sriharsha (Yahoo!)
> * Robert Evans (Yahoo!)
> * Mridul Muralidharam (Yahoo!)
> * Thomas Dudziak (ClearStory)
> * Mark Hamstra (ClearStory)
> * Stephen Haberman (Bizo)
> * Jason Dai (Intel)
> * Shane Huang (Intel)
> * Andrew Xia (Intel)
> * Nick Pentreath (Mxit)
> * Sean McNamara (Webtrends)
>=20
> =3D=3D Sponsors =3D=3D
> =3D=3D=3D Champion =3D=3D=3D
> * Chris Mattmann
>=20
> =3D=3D=3D Nominated Mentors =3D=3D=3D
> * Chris Mattmann
> * Paul Ramirez=20
> * Andrew Hart=20
> * Thomas Dudziak=20
> * Suresh Marru
> * Henry Saputra
>=20
> =3D=3D=3D Sponsoring Entity =3D=3D=3D
> The Apache Incubator
>=20
>=20
>=20
>=20
>=20
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>=20


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org