incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "DataflowProposal" by jbonofre
Date Thu, 28 Jan 2016 06:26:39 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DataflowProposal" page has been changed by jbonofre:
https://wiki.apache.org/incubator/DataflowProposal?action=diff&rev1=26&rev2=27

- = Apache Dataflow =
+ = Apache Beam =
  
  == Abstract ==
  
- Dataflow is an open source, unified model and set of language-specific SDKs for defining
and executing data processing workflows, and also data ingestion and integration flows, supporting
Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify the mechanics of large-scale batch and streaming data processing and can run on a
number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service).
Dataflow also brings DSL in different languages, allowing users to easily implement their
data integration processes.
+ Apache Beam is an open source, unified model and set of language-specific SDKs for defining
and executing data processing workflows, and also data ingestion and integration flows, supporting
Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify the mechanics of large-scale batch and streaming data processing and can run on a
number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service).
Beam also brings DSL in different languages, allowing users to easily implement their data
integration processes.
  
  == Proposal ==
  
- Dataflow is a simple, flexible, and powerful system for distributed data processing at any
scale. Dataflow provides a unified programming model, a software development kit to define
and construct data processing pipelines, and runners to execute Dataflow pipelines in several
runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
used for a variety of streaming or batch data processing goals including ETL, stream analysis,
and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like
parallelism, combined with support for powerful data windowing, and fine-grained correctness
control. 
+ Beam is a simple, flexible, and powerful system for distributed data processing at any scale.
Beam provides a unified programming model, a software development kit to define and construct
data processing pipelines, and runners to execute Beam pipelines in several runtime engines,
like Apache Spark, Apache Flink, or Google Cloud Dataflow. Beam can be used for a variety
of streaming or batch data processing goals including ETL, stream analysis, and aggregate
computation. The underlying programming model for Beam provides MapReduce-like parallelism,
combined with support for powerful data windowing, and fine-grained correctness control. 
  
  == Background ==
  
- Dataflow started as a set of Google projects focused on making data processing easier, faster,
and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel
inside Google and is focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several papers made available
to the public:
+ Beam started as a set of Google projects (Google Cloud Dataflow) focused on making data
processing easier, faster, and less costly. The Beam model is a successor to MapReduce, FlumeJava,
and Millwheel inside Google and is focused on providing a unified solution for batch and stream
processing. These projects on which Beam is based have been published in several papers made
available to the public:
  
   * MapReduce - http://research.google.com/archive/mapreduce.html
   * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
   * FlumeJava - http://research.google.com/pubs/pub35650.html
   * MillWheel - http://research.google.com/pubs/pub41378.html
  
- Dataflow was designed from the start to provide a portable programming layer. When you define
a data processing pipeline with the Dataflow model, you are creating a job which is capable
of being processed by any number of Dataflow processing engines. Several engines have been
developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner
for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the
developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program
to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow
Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service.
Another Python SDK is currently in active development.
+ Beam was designed from the start to provide a portable programming layer. When you define
a data processing pipeline with the Beam model, you are creating a job which is capable of
being processed by any number of Beam processing engines. Several engines have been developed
to run Beam pipelines in other open source runtimes, including a Beam runner for Apache Flink
and Apache Spark. There is also a “direct runner”, for execution on the developer machine
(mainly for dev/debug purposes). Another runner allows a Beam program to run on a managed
service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already
available on GitHub, and independent from the Google Cloud Dataflow service. Another Python
SDK is currently in active development.
  
- In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an
OSS project under the ASF. The runners which are a part of this proposal include those for
Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the
Google Cloud Dataflow service runner is not included in this proposal. Further references
to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal
(Apache Dataflow) only. The initial submission will contain the already-released Java SDK;
Google intends to submit the Python SDK later in the incubation process. The Google Cloud
Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the
Apache project additions, updates, and changes. Google Cloud Dataflow will become one user
of Apache Dataflow and will participate in the project openly and publicly. 
+ In this proposal, the Beam SDKs, model, and a set of runners will be submitted as an OSS
project under the ASF. The runners which are a part of this proposal include those for Spark
(from Cloudera), Flink (from data Artisans), and local development (from Google); the Google
Cloud Dataflow service runner is not included in this proposal. Further references to Beam
will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache
Beam) only. The initial submission will contain the already-released Java SDK; Google intends
to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service
will continue to be one of many runners for Beam, built on Google Cloud Platform, to run Beam
pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions,
updates, and changes. Google Cloud Dataflow will become one user of Apache Beam and will participate
in the project openly and publicly. 
  
- The Dataflow programming model has been designed with simplicity, scalability, and speed
as key tenants. In the Dataflow model, you only need to think about four top-level concepts
when constructing your data processing job:
+ The Beam programming model has been designed with simplicity, scalability, and speed as
key tenants. In the Beam model, you only need to think about four top-level concepts when
constructing your data processing job:
  
   * Pipelines - The data processing job made of a series of computations including input,
processing, and output
   * PCollections - Bounded (or unbounded) datasets which represent the input, intermediate
and output data in pipelines
@@ -32, +32 @@

  
  == Rationale ==
  
- With Dataflow, Google intended to develop a framework which allowed developers to be maximally
productive in defining the processing, and then be able to execute the program at various
levels of latency/cost/completeness without re-architecting or re-writing it. This goal was
informed by Google’s past experience  developing several models, frameworks, and tools useful
for large-scale and distributed data processing. While Google has previously published papers
describing some of its technologies, Google decided to take a different approach with Dataflow.
Google open-sourced the SDK and model alongside commercialization of the idea and ahead of
publishing papers on the topic. As a result, a number of open source runtimes exist for Dataflow,
such as the Apache Flink and Apache Spark runners.
+ With Google Dataflow, Google intended to develop a framework which allowed developers to
be maximally productive in defining the processing, and then be able to execute the program
at various levels of latency/cost/completeness without re-architecting or re-writing it. This
goal was informed by Google’s past experience  developing several models, frameworks, and
tools useful for large-scale and distributed data processing. While Google has previously
published papers describing some of its technologies, Google decided to take a different approach
with Dataflow. Google open-sourced the SDK and model alongside commercialization of the idea
and ahead of publishing papers on the topic. As a result, a number of open source runtimes
exist for Dataflow, such as the Apache Flink and Apache Spark runners.
  
- We believe that submitting Dataflow as an Apache project will provide an immediate, worthwhile,
and substantial contribution to the open source community. As an incubating project, we believe
Dataflow will have a better opportunity to provide a meaningful contribution to OSS and also
integrate with other Apache projects.
+ We believe that submitting Beam as an Apache project will provide an immediate, worthwhile,
and substantial contribution to the open source community. As an incubating project, we believe
Dataflow will have a better opportunity to provide a meaningful contribution to OSS and also
integrate with other Apache projects.
  
- In the long term, we believe Dataflow can be a powerful abstraction layer for data processing.
By providing an abstraction layer for data pipelines and processing, data workflows can be
increasingly portable, resilient to breaking changes in tooling, and compatible across many
execution engines, runtimes, and open source projects. 
+ In the long term, we believe Beam can be a powerful abstraction layer for data processing.
By providing an abstraction layer for data pipelines and processing, data workflows can be
increasingly portable, resilient to breaking changes in tooling, and compatible across many
execution engines, runtimes, and open source projects. 
  
  == Initial Goals ==
  
@@ -56, +56 @@

   * Continue development of new features, functions, and fixes in the Dataflow Java SDK,
and Dataflow runners
   * Cleaning up the Dataflow SDK sources and crafting a roadmap and plan for how to include
new major ideas, modules, and runtimes
   * Establishment of easy and clear build/test framework for Dataflow and associated runtimes;
creation of testing, rollback, and validation policy
-  * Analysis and design for work needed to make Dataflow a better data processing abstraction
layer for multiple open source frameworks and environments
+  * Analysis and design for work needed to make Beam a better data processing abstraction
layer for multiple open source frameworks and environments
  
  Finally, we have a number of intermediate-term goals:
  
@@ -67, +67 @@

  
  === Meritocracy ===
  
- Dataflow was initially developed based on ideas from many employees within Google. As an
ASL OSS project on GitHub, the Dataflow SDK has received contributions from data Artisans,
Cloudera Labs, and other individual developers. As a project under incubation, we are committed
to expanding our effort to build an environment which supports a meritocracy. We are focused
on engaging the community and other related projects for support and contributions. Moreover,
we are committed to ensure contributors and committers to Dataflow come from a broad mix of
organizations through a merit-based decision process during incubation. We believe strongly
in the Dataflow model and are committed to growing an inclusive community of Dataflow contributors.
+ Dataflow was initially developed based on ideas from many employees within Google. As an
ASL OSS project on GitHub, the Dataflow SDK has received contributions from data Artisans,
Cloudera Labs, and other individual developers. As a project under incubation, we are committed
to expanding our effort to build an environment which supports a meritocracy. We are focused
on engaging the community and other related projects for support and contributions. Moreover,
we are committed to ensure contributors and committers to Dataflow come from a broad mix of
organizations through a merit-based decision process during incubation. We believe strongly
in the Beam model and are committed to growing an inclusive community of Beam contributors.
  
  === Community ===
  
@@ -95, +95 @@

  
  === Alignment ===
  
- The Dataflow SDK can be used to create Dataflow pipelines which can be executed on Apache
Spark or Apache Flink. Dataflow is also related to other Apache projects, such as Apache Crunch.
We plan on expanding functionality for Dataflow runners, support for additional domain specific
languages, and increased portability so Dataflow is a powerful abstraction layer for data
processing.
+ The Beam SDK can be used to create Beam pipelines which can be executed on Apache Spark
or Apache Flink. Beam is also related to other Apache projects, such as Apache Crunch. We
plan on expanding functionality for Beam runners, support for additional domain specific languages,
and increased portability so Beam is a powerful abstraction layer for data processing.
  
  == Known Risks ==
  
@@ -128, +128 @@

    * Apache Flink
    * Apache Spark
  
- Dataflow when used in batch mode shares similarities with Apache Crunch; however, Dataflow
is focused on a model, SDK, and abstraction layer beyond Spark and Hadoop (MapReduce.) One
key goal of Dataflow is to provide an intermediate abstraction layer which can easily be implemented
and utilized across several different processing frameworks.
+ Beam when used in batch mode shares similarities with Apache Crunch; however, Beam is focused
on a model, SDK, and abstraction layer beyond Spark and Hadoop (MapReduce.) One key goal of
Beam is to provide an intermediate abstraction layer which can easily be implemented and utilized
across several different processing frameworks.
  
  === An excessive fascination with the Apache brand ===
  
- With this proposal we are not seeking attention or publicity. Rather, we firmly believe
in the Dataflow model, SDK, and the ability to make Dataflow a powerful yet simple framework
for data processing. While the Dataflow SDK and model have been open source, we believe putting
code on GitHub can only go so far. We see the Apache community, processes, and mission as
critical for ensuring the Dataflow SDK and model are truly community-driven, positively impactful,
and innovative open source software. While Google has taken a number of steps to advance its
various open source projects, we believe Dataflow is a great fit for the Apache Software Foundation
due to its focus on data processing and its relationships to existing ASF projects.
+ With this proposal we are not seeking attention or publicity. Rather, we firmly believe
in the Beam model, SDK, and the ability to make Beam a powerful yet simple framework for data
processing. While the Dataflow SDK and model have been open source, we believe putting code
on GitHub can only go so far. We see the Apache community, processes, and mission as critical
for ensuring the Beam SDK and model are truly community-driven, positively impactful, and
innovative open source software. While Google has taken a number of steps to advance its various
open source projects, we believe Beam is a great fit for the Apache Software Foundation due
to its focus on data processing and its relationships to existing ASF projects.
  
  == Documentation ==
  
- The following documentation is relevant to this proposal. Relevant portion of the documentation
will be contributed to the Apache Dataflow project.
+ The following documentation is relevant to this proposal. Relevant portion of the documentation
will be contributed to the Apache Beam project.
  
   * Dataflow website: https://cloud.google.com/dataflow
   * Dataflow programming model: https://cloud.google.com/dataflow/model/programming-model
@@ -149, +149 @@

  
  == Initial Source ==
  
- The initial source for Dataflow which we will submit to the Apache Foundation will include
several related projects which are currently hosted on the GitHub repositories: 
+ The initial source for Beam which we will submit to the Apache Foundation will include several
related projects which are currently hosted on the GitHub repositories: 
  
   * Dataflow Java SDK (https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
   * Flink Dataflow runner (https://github.com/dataArtisans/flink-dataflow) 
@@ -179, +179 @@

  
  We currently use a mix of mailing lists. We will migrate our existing mailing lists to the
following:
  
-  * dev@dataflow.incubator.apache.org
+  * dev@beam.incubator.apache.org
-  * user@dataflow.incubator.apache.org
+  * user@beam.incubator.apache.org
-  * private@dataflow.incubator.apache.org
+  * private@beam.incubator.apache.org
-  * commits@dataflow.incubator.apache.org
+  * commits@beam.incubator.apache.org
  
  === Source Control ===
  
- The Dataflow team currently uses Git and would like to continue to do so. We request a Git
repository for Dataflow with mirroring to GitHub enabled. 
+ The Dataflow team currently uses Git and would like to continue to do so. We request a Git
repository for Beam with mirroring to GitHub enabled. 
+ 
+  * https://git-wip-us.apache.org/repos/asf/incubator-beam.git
  
  === Issue Tracking ===
  
  We request the creation of an Apache-hosted JIRA. The Dataflow project is currently using
both a public GitHub issue tracker and internal Google issue tracking. We will migrate and
combine from these two sources to the Apache JIRA. 
+ 
+  * Jira ID: BEAM
  
  == Initial Committers ==
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message