Mailing-List: contact general-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@incubator.apache.org
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: [VOTE] Accept Beam into the Apache Incubator
From: Suresh Marru <smarru@apache.org>
In-Reply-To: <56AA2574.2000700@nanthrax.net>
Date: Thu, 28 Jan 2016 11:38:56 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <27657328-E970-4A73-914C-0B0F7D11F90B@apache.org>
References: <56AA2574.2000700@nanthrax.net>
To: general@incubator.apache.org

+ 1 (binding).

Suresh

> On Jan 28, 2016, at 9:28 AM, Jean-Baptiste Onofr=C3=A9 =
<jb@nanthrax.net> wrote:
>=20
> Hi,
>=20
> the Beam proposal (initially Dataflow) was proposed last week.
>=20
> The complete discussion thread is available here:
>=20
> =
http://mail-archives.apache.org/mod_mbox/incubator-general/201601.mbox/%3C=
CA%2B%3DKJmvj4wyosNTXVpnsH8PhS7jEyzkZngc682rGgZ3p28L42Q%40mail.gmail.com%3=
E
>=20
> As reminder the BeamProposal is here:
>=20
> https://wiki.apache.org/incubator/BeamProposal
>=20
> Regarding all the great feedbacks we received on the mailing list, we =
think it's time to call a vote to accept Beam into the Incubator.
>=20
> Please cast your vote to:
> [] +1 - accept Apache Beam as a new incubating project
> []  0 - not sure
> [] -1 - do not accept the Apache Beam project (because: ...)
>=20
> Thanks,
> Regards
> JB
> ----
> ## page was renamed from DataflowProposal
> =3D Apache Beam =3D
>=20
> =3D=3D Abstract =3D=3D
>=20
> Apache Beam is an open source, unified model and set of =
language-specific SDKs for defining and executing data processing =
workflows, and also data ingestion and integration flows, supporting =
Enterprise Integration Patterns (EIPs) and Domain Specific Languages =
(DSLs). Dataflow pipelines simplify the mechanics of large-scale batch =
and streaming data processing and can run on a number of runtimes like =
Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). =
Beam also brings DSL in different languages, allowing users to easily =
implement their data integration processes.
>=20
> =3D=3D Proposal =3D=3D
>=20
> Beam is a simple, flexible, and powerful system for distributed data =
processing at any scale. Beam provides a unified programming model, a =
software development kit to define and construct data processing =
pipelines, and runners to execute Beam pipelines in several runtime =
engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Beam =
can be used for a variety of streaming or batch data processing goals =
including ETL, stream analysis, and aggregate computation. The =
underlying programming model for Beam provides MapReduce-like =
parallelism, combined with support for powerful data windowing, and =
fine-grained correctness control.
>=20
> =3D=3D Background =3D=3D
>=20
> Beam started as a set of Google projects (Google Cloud Dataflow) =
focused on making data processing easier, faster, and less costly. The =
Beam model is a successor to MapReduce, FlumeJava, and Millwheel inside =
Google and is focused on providing a unified solution for batch and =
stream processing. These projects on which Beam is based have been =
published in several papers made available to the public:
>=20
> * MapReduce - http://research.google.com/archive/mapreduce.html
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> * FlumeJava - http://research.google.com/pubs/pub35650.html
> * MillWheel - http://research.google.com/pubs/pub41378.html
>=20
> Beam was designed from the start to provide a portable programming =
layer. When you define a data processing pipeline with the Beam model, =
you are creating a job which is capable of being processed by any number =
of Beam processing engines. Several engines have been developed to run =
Beam pipelines in other open source runtimes, including a Beam runner =
for Apache Flink and Apache Spark. There is also a =E2=80=9Cdirect =
runner=E2=80=9D, for execution on the developer machine (mainly for =
dev/debug purposes). Another runner allows a Beam program to run on a =
managed service, Google Cloud Dataflow, in Google Cloud Platform. The =
Dataflow Java SDK is already available on GitHub, and independent from =
the Google Cloud Dataflow service. Another Python SDK is currently in =
active development.
>=20
> In this proposal, the Beam SDKs, model, and a set of runners will be =
submitted as an OSS project under the ASF. The runners which are a part =
of this proposal include those for Spark (from Cloudera), Flink (from =
data Artisans), and local development (from Google); the Google Cloud =
Dataflow service runner is not included in this proposal. Further =
references to Beam will refer to the Dataflow model, SDKs, and runners =
which are a part of this proposal (Apache Beam) only. The initial =
submission will contain the already-released Java SDK; Google intends to =
submit the Python SDK later in the incubation process. The Google Cloud =
Dataflow service will continue to be one of many runners for Beam, built =
on Google Cloud Platform, to run Beam pipelines. Necessarily, Cloud =
Dataflow will develop against the Apache project additions, updates, and =
changes. Google Cloud Dataflow will become one user of Apache Beam and =
will participate in the project openly and publicly.
>=20
> The Beam programming model has been designed with simplicity, =
scalability, and speed as key tenants. In the Beam model, you only need =
to think about four top-level concepts when constructing your data =
processing job:
>=20
> * Pipelines - The data processing job made of a series of computations =
including input, processing, and output
> * PCollections - Bounded (or unbounded) datasets which represent the =
input, intermediate and output data in pipelines
> * PTransforms - A data processing step in a pipeline in which one or =
more PCollections are an input and output
> * I/O Sources and Sinks - APIs for reading and writing data which are =
the roots and endpoints of the pipeline
>=20
> =3D=3D Rationale =3D=3D
>=20
> With Google Dataflow, Google intended to develop a framework which =
allowed developers to be maximally productive in defining the =
processing, and then be able to execute the program at various levels of =
latency/cost/completeness without re-architecting or re-writing it. This =
goal was informed by Google=E2=80=99s past experience  developing =
several models, frameworks, and tools useful for large-scale and =
distributed data processing. While Google has previously published =
papers describing some of its technologies, Google decided to take a =
different approach with Dataflow. Google open-sourced the SDK and model =
alongside commercialization of the idea and ahead of publishing papers =
on the topic. As a result, a number of open source runtimes exist for =
Dataflow, such as the Apache Flink and Apache Spark runners.
>=20
> We believe that submitting Beam as an Apache project will provide an =
immediate, worthwhile, and substantial contribution to the open source =
community. As an incubating project, we believe Dataflow will have a =
better opportunity to provide a meaningful contribution to OSS and also =
integrate with other Apache projects.
>=20
> In the long term, we believe Beam can be a powerful abstraction layer =
for data processing. By providing an abstraction layer for data =
pipelines and processing, data workflows can be increasingly portable, =
resilient to breaking changes in tooling, and compatible across many =
execution engines, runtimes, and open source projects.
>=20
> =3D=3D Initial Goals =3D=3D
>=20
> We are breaking our initial goals into immediate (< 2 months), =
short-term (2-4 months), and intermediate-term (> 4 months).
>=20
> Our immediate goals include the following:
>=20
> * Plan for reconciling the Dataflow Java SDK and various runners into =
one project
> * Plan for refactoring the existing Java SDK for better extensibility =
by SDK and runner writers
> * Validating all dependencies are ASL 2.0 or compatible
> * Understanding and adapting to the Apache development process
>=20
> Our short-term goals include:
>=20
> * Moving the newly-merged lists, and build utilities to Apache
> * Start refactoring codebase and move code to Apache Git repo
> * Continue development of new features, functions, and fixes in the =
Dataflow Java SDK, and Dataflow runners
> * Cleaning up the Dataflow SDK sources and crafting a roadmap and plan =
for how to include new major ideas, modules, and runtimes
> * Establishment of easy and clear build/test framework for Dataflow =
and associated runtimes; creation of testing, rollback, and validation =
policy
> * Analysis and design for work needed to make Beam a better data =
processing abstraction layer for multiple open source frameworks and =
environments
>=20
> Finally, we have a number of intermediate-term goals:
>=20
> * Roadmapping, planning, and execution of integrations with other OSS =
and non-OSS projects/products
> * Inclusion of additional SDK for Python, which is under active =
development
>=20
> =3D=3D Current Status =3D=3D
>=20
> =3D=3D=3D Meritocracy =3D=3D=3D
>=20
> Dataflow was initially developed based on ideas from many employees =
within Google. As an ASL OSS project on GitHub, the Dataflow SDK has =
received contributions from data Artisans, Cloudera Labs, and other =
individual developers. As a project under incubation, we are committed =
to expanding our effort to build an environment which supports a =
meritocracy. We are focused on engaging the community and other related =
projects for support and contributions. Moreover, we are committed to =
ensure contributors and committers to Dataflow come from a broad mix of =
organizations through a merit-based decision process during incubation. =
We believe strongly in the Beam model and are committed to growing an =
inclusive community of Beam contributors.
>=20
> =3D=3D=3D Community =3D=3D=3D
>=20
> The core of the Dataflow Java SDK has been developed by Google for use =
with Google Cloud Dataflow. Google has active community engagement in =
the SDK GitHub repository =
(https://github.com/GoogleCloudPlatform/DataflowJavaSDK), on Stack =
Overflow =
(http://stackoverflow.com/questions/tagged/google-cloud-dataflow) and =
has had contributions from a number of organizations and indivuduals.
>=20
> Everyday, Cloud Dataflow is actively used by a number of organizations =
and institutions for batch and stream processing of data. We believe =
acceptance will allow us to consolidate existing Dataflow-related work, =
grow the Dataflow community, and deepen connections between Dataflow and =
other open source projects.
>=20
> =3D=3D=3D Core Developers =3D=3D=3D
>=20
> The core developers for Dataflow and the Dataflow runners are:
>=20
> * Frances Perry
> * Tyler Akidau
> * Davor Bonaci
> * Luke Cwik
> * Ben Chambers
> * Kenn Knowles
> * Dan Halperin
> * Daniel Mills
> * Mark Shields
> * Craig Chambers
> * Maximilian Michels
> * Tom White
> * Josh Wills
> * Robert Bradshaw
>=20
> =3D=3D=3D Alignment =3D=3D=3D
>=20
> The Beam SDK can be used to create Beam pipelines which can be =
executed on Apache Spark or Apache Flink. Beam is also related to other =
Apache projects, such as Apache Crunch. We plan on expanding =
functionality for Beam runners, support for additional domain specific =
languages, and increased portability so Beam is a powerful abstraction =
layer for data processing.
>=20
> =3D=3D Known Risks =3D=3D
>=20
> =3D=3D=3D Orphaned Products =3D=3D=3D
>=20
> The Dataflow SDK is presently used by several organizations, from =
small startups to Fortune 100 companies, to construct production =
pipelines which are executed in Google Cloud Dataflow. Google has a =
long-term commitment to advance the Dataflow SDK; moreover, Dataflow is =
seeing increasing interest, development, and adoption from organizations =
outside of Google.
>=20
> =3D=3D=3D Inexperience with Open Source =3D=3D=3D
>=20
> Google believes strongly in open source and the exchange of =
information to advance new ideas and work. Examples of this commitment =
are active OSS projects such as Chromium (https://www.chromium.org) and =
Kubernetes (http://kubernetes.io/). With Dataflow, we have tried to be =
increasingly open and forward-looking; we have published a paper in the =
VLDB conference describing the Dataflow model =
(http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick to =
release the Dataflow SDK as open source software with the launch of =
Cloud Dataflow. Our submission to the Apache Software Foundation is a =
logical extension of our commitment to open source software.
>=20
> =3D=3D=3D Homogeneous Developers =3D=3D=3D
>=20
> The majority of committers in this proposal belong to Google due to =
the fact that Dataflow has emerged from several internal Google =
projects. This proposal also includes committers outside of Google who =
are actively involved with other Apache projects, such as Hadoop, Flink, =
and Spark.  We expect our entry into incubation will allow us to expand =
the number of individuals and organizations participating in Dataflow =
development. Additionally, separation of the Dataflow SDK from Google =
Cloud Dataflow allows us to focus on the open source SDK and model and =
do what is best for this project.
>=20
> =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D
>=20
> The Dataflow SDK and Dataflow runners have been developed primarily by =
salaried developers supporting the Google Cloud Dataflow project. While =
the Dataflow SDK and Cloud Dataflow have been developed by different =
teams (and this proposal would reinforce that separation) we expect our =
initial set of developers will still primarily be salaried. Contribution =
has not been exclusively from salaried developers, however. For example, =
the contrib directory of the Dataflow SDK =
(https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contri=
b) contains items from free-time contributors. Moreover, seperate =
projects, such as ScalaFlow (https://github.com/darkjh/scalaflow) have =
been created around the Dataflow model and SDK. We expect our reliance =
on salaried developers will decrease over time during incubation.
>=20
> =3D=3D=3D Relationship with other Apache products =3D=3D=3D
>=20
> Dataflow directly interoperates with or utilizes several existing =
Apache projects.
>=20
> * Build
>  * Apache Maven
> * Data I/O, Libraries
>  * Apache Avro
>  * Apache Commons
> * Dataflow runners
>  * Apache Flink
>  * Apache Spark
>=20
> Beam when used in batch mode shares similarities with Apache Crunch; =
however, Beam is focused on a model, SDK, and abstraction layer beyond =
Spark and Hadoop (MapReduce.) One key goal of Beam is to provide an =
intermediate abstraction layer which can easily be implemented and =
utilized across several different processing frameworks.
>=20
> =3D=3D=3D An excessive fascination with the Apache brand =3D=3D=3D
>=20
> With this proposal we are not seeking attention or publicity. Rather, =
we firmly believe in the Beam model, SDK, and the ability to make Beam a =
powerful yet simple framework for data processing. While the Dataflow =
SDK and model have been open source, we believe putting code on GitHub =
can only go so far. We see the Apache community, processes, and mission =
as critical for ensuring the Beam SDK and model are truly =
community-driven, positively impactful, and innovative open source =
software. While Google has taken a number of steps to advance its =
various open source projects, we believe Beam is a great fit for the =
Apache Software Foundation due to its focus on data processing and its =
relationships to existing ASF projects.
>=20
> =3D=3D Documentation =3D=3D
>=20
> The following documentation is relevant to this proposal. Relevant =
portion of the documentation will be contributed to the Apache Beam =
project.
>=20
> * Dataflow website: https://cloud.google.com/dataflow
> * Dataflow programming model: =
https://cloud.google.com/dataflow/model/programming-model
> * Codebases
>  * Dataflow Java SDK: =
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>  * Flink Dataflow runner: =
https://github.com/dataArtisans/flink-dataflow
>  * Spark Dataflow runner: https://github.com/cloudera/spark-dataflow
> * Dataflow Java SDK issue tracker: =
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
> * google-cloud-dataflow tag on Stack Overflow: =
http://stackoverflow.com/questions/tagged/google-cloud-dataflow
>=20
> =3D=3D Initial Source =3D=3D
>=20
> The initial source for Beam which we will submit to the Apache =
Foundation will include several related projects which are currently =
hosted on the GitHub repositories:
>=20
> * Dataflow Java SDK =
(https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
> * Flink Dataflow runner =
(https://github.com/dataArtisans/flink-dataflow)
> * Spark Dataflow runner (https://github.com/cloudera/spark-dataflow)
>=20
> These projects have always been Apache 2.0 licensed. We intend to =
bundle all of these repositories since they are all complimentary and =
should be maintained in one project. Prior to our submission, we will =
combine all of these projects into a new git repository.
>=20
> =3D=3D Source and Intellectual Property Submission Plan =3D=3D
>=20
> The source for the Dataflow SDK and the three runners (Spark, Flink, =
Google Cloud Dataflow) are already licensed under an Apache 2 license.
>=20
> * Dataflow SDK - =
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE=

> * Flink runner - =
https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
> * Spark runner - =
https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
>=20
> Contributors to the Dataflow SDK have also signed the Google =
Individual Contributor License Agreement =
(https://cla.developers.google.com/about/google-individual) in order to =
contribute to the project.
>=20
> With respect to trademark rights, Google does not hold a trademark on =
the phrase =E2=80=9CDataflow.=E2=80=9D Based on feedback and guidance we =
receive during the incubation process, we are open to renaming the =
project if necessary for trademark or other concerns.
>=20
> =3D=3D External Dependencies =3D=3D
>=20
> All external dependencies are licensed under an Apache 2.0 or =
Apache-compatible license. As we grow the Beam community we will =
configure our build process to require and validate all contributions =
and dependencies are licensed under the Apache 2.0 license or are under =
an Apache-compatible license.
>=20
> =3D=3D Required Resources =3D=3D
>=20
> =3D=3D=3D Mailing Lists =3D=3D=3D
>=20
> We currently use a mix of mailing lists. We will migrate our existing =
mailing lists to the following:
>=20
> * dev@beam.incubator.apache.org
> * user@beam.incubator.apache.org
> * private@beam.incubator.apache.org
> * commits@beam.incubator.apache.org
>=20
> =3D=3D=3D Source Control =3D=3D=3D
>=20
> The Dataflow team currently uses Git and would like to continue to do =
so. We request a Git repository for Beam with mirroring to GitHub =
enabled.
>=20
> * https://git-wip-us.apache.org/repos/asf/incubator-beam.git
>=20
> =3D=3D=3D Issue Tracking =3D=3D=3D
>=20
> We request the creation of an Apache-hosted JIRA. The Dataflow project =
is currently using both a public GitHub issue tracker and internal =
Google issue tracking. We will migrate and combine from these two =
sources to the Apache JIRA.
>=20
> * Jira ID: BEAM
>=20
> =3D=3D Initial Committers =3D=3D
>=20
> * Aljoscha Krettek        [aljoscha@apache.org]
> * Amit Sela               [amitsela33@gmail.com]
> * Ben Chambers            [bchambers@google.com]
> * Craig Chambers          [chambers@google.com]
> * Dan Halperin            [dhalperi@google.com]
> * Davor Bonaci            [davor@google.com]
> * Frances Perry           [fjp@google.com]
> * James Malone            [jamesmalone@google.com]
> * Jean-Baptiste Onofr=C3=A9    [jbonofre@apache.org]
> * Josh Wills              [jwills@apache.org]
> * Kostas Tzoumas          [kostas@data-artisans.com]
> * Kenneth Knowles         [klk@google.com]
> * Luke Cwik               [lcwik@google.com]
> * Maximilian Michels      [mxm@apache.org]
> * Stephan Ewen            [stephan@data-artisans.com]
> * Tom White               [tom@cloudera.com]
> * Tyler Akidau            [takidau@google.com]
> * Robert Bradshaw         [robertwb@google.com]
>=20
> =3D=3D Additional Interested Contributors =3D=3D
>=20
> * Debo Dutta              [dedutta@cisco.com]
> * Henry Saputra           [hsaputra@apache.org]
> * Taylor Goetz            [ptgoetz@gmail.com]
> * James Carman            [james@carmanconsulting.com]
> * Joe Witt                [joewitt@apache.org]
> * Vaibhav Gumashta        [vgumashta@hortonworks.com]
> * Prasanth Jayachandran   [pjayachandran@hortonworks.com]
> * Johan Edstrom           [seijoed@gmail.com]
> * Hugo Louro              [hmclouro@gmail.com]
> * Krzysztof Sobkowiak     [krzys.sobkowiak@gmail.com]
> * Jeff Genender           [jgenender@apache.org]
> * Edward J. Yoon          [edward.yoon@samsung.com]
> * Hao Chen                [hao@apache.org]
> * Byung-Gon Chun          [bgchun@gmail.com]
> * Charitha Elvitigala     [charithcc@apache.org]
> * Alexander Bezzubov      [bzz@apache.org]
> * Tsuyoshi Ozawa          [ozawa@apache.org]
> * Mayank Bansal           [mabansal@gmail.com]
> * Supun Kamburugamuve     [supun@apache.org]
> * Matthias Wessendorf     [matzew@apache.org]
> * Felix Cheung            [felixcheung@apache.org]
> * Ajay Yadava             [ajay.yadav@inmobi.com]
> * Liang Chen              [chenliang613@huawei.com]
> * Renaud Richardet        [renaud (at) apache (dot) org]
> * Bakey Pan               [bakey1985@gmail.com]
> * Andreas Neumann         [anew@apache.org]
> * Suresh Marru            [smarru@apache.org]
> * Hadrian Zbarcea         [hzbarcea@gmail.com]
>=20
> =3D=3D Affiliations =3D=3D
>=20
> The initial committers are from six organizations. Google developed =
Dataflow and the Dataflow SDK, data Artisans developed the Flink runner, =
and Cloudera (Labs) developed the Spark runner.
>=20
> * Cloudera
>  * Tom White
> * Data Artisans
>  * Aljoscha Krettek
>  * Kostas Tzoumas
>  * Maximilian Michels
>  * Stephan Ewen
> * Google
>  * Ben Chambers
>  * Dan Halperin
>  * Davor Bonaci
>  * Frances Perry
>  * James Malone
>  * Kenneth Knowles
>  * Luke Cwik
>  * Tyler Akidau
>  * Robert Bradshaw
> * PayPal
>  * Amit Sela
> * Slack
>  * Josh Wills
> * Talend
>  * Jean-Baptiste Onofr=C3=A9
>=20
> =3D=3D Sponsors =3D=3D
>=20
> =3D=3D=3D Champion =3D=3D=3D
>=20
> * Jean-Baptiste Onofre         [jbonofre@apache.org]
>=20
> =3D=3D=3D Nominated Mentors =3D=3D=3D
>=20
> * Jean-Baptiste Onofre       [jbonofre@apache.org]
> * Jim Jagielski              [jim@apache.org]
> * Venkatesh Seetharam        [venkatesh@apache.org]
> * Bertrand Delacretaz        [bdelacretaz@apache.org]
> * Ted Dunning                [tdunning@apache.org]
>=20
> =3D=3D=3D Sponsoring Entity =3D=3D=3D
>=20
> The Apache Incubator
> ----
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>=20


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org