incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "DataflowProposal" by jbonofre
Date Wed, 20 Jan 2016 15:48:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DataflowProposal" page has been changed by jbonofre:
https://wiki.apache.org/incubator/DataflowProposal?action=diff&rev1=1&rev2=2

  
  Dataflow started as a set of Google projects focused on making data processing easier, faster,
and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel
inside Google and is focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several papers made available
to the public:
  
- * MapReduce - http://research.google.com/archive/mapreduce.html
+  * MapReduce - http://research.google.com/archive/mapreduce.html
- * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
+  * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
- * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
+  * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
- * MillWheel - http://research.google.com/pubs/pub41378.html
+  * MillWheel - http://research.google.com/pubs/pub41378.html
  
  Dataflow was designed from the start to provide a portable programming layer. When you define
a data processing pipeline with the Dataflow model, you are creating a job which is capable
of being processed by any number of Dataflow processing engines. Several engines have been
developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner
for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the
developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program
to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow
Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service.
Another Python SDK is currently in active development.
  
@@ -25, +25 @@

  
  The Dataflow programming model has been designed with simplicity, scalability, and speed
as key tenants. In the Dataflow model, you only need to think about four top-level concepts
when constructing your data processing job:
  
- * Pipelines - The data processing job made of a series of computations including input,
processing, and output
+  * Pipelines - The data processing job made of a series of computations including input,
processing, and output
- * PCollections - Bounded (or unbounded) datasets which represent the input, intermediate
and output data in pipelines
+  * PCollections - Bounded (or unbounded) datasets which represent the input, intermediate
and output data in pipelines
- * PTransforms - A data processing step in a pipeline in which one or more PCollections are
an input and output
+  * PTransforms - A data processing step in a pipeline in which one or more PCollections
are an input and output
- * I/O Sources and Sinks - APIs for reading and writing data which are the roots and endpoints
of the pipeline
+  * I/O Sources and Sinks - APIs for reading and writing data which are the roots and endpoints
of the pipeline
  
  == Rationale ==
  
@@ -44, +44 @@

  
  Our immediate goals include the following:
  
- * Plan for reconciling the Dataflow Java SDK and various runners into one project
+  * Plan for reconciling the Dataflow Java SDK and various runners into one project
- * Plan for refactoring the existing Java SDK for better extensibility by SDK and runner
writers
+  * Plan for refactoring the existing Java SDK for better extensibility by SDK and runner
writers
- * Validating all dependencies are ASL 2.0 or compatible
+  * Validating all dependencies are ASL 2.0 or compatible
- * Understanding and adapting to the Apache development process
+  * Understanding and adapting to the Apache development process
  
  Our short-term goals include:
  
- * Moving the newly-merged lists, and build utilities to Apache
+  * Moving the newly-merged lists, and build utilities to Apache
- * Start refactoring codebase and move code to Apache Git repo
+  * Start refactoring codebase and move code to Apache Git repo
- * Continue development of new features, functions, and fixes in the Dataflow Java SDK, and
Dataflow runners
+  * Continue development of new features, functions, and fixes in the Dataflow Java SDK,
and Dataflow runners
- * Cleaning up the Dataflow SDK sources and crafting a roadmap and plan for how to include
new major ideas, modules, and runtimes
+  * Cleaning up the Dataflow SDK sources and crafting a roadmap and plan for how to include
new major ideas, modules, and runtimes
- * Establishment of easy and clear build/test framework for Dataflow and associated runtimes;
creation of testing, rollback, and validation policy
+  * Establishment of easy and clear build/test framework for Dataflow and associated runtimes;
creation of testing, rollback, and validation policy
- * Analysis and design for work needed to make Dataflow a better data processing abstraction
layer for multiple open source frameworks and environments
+  * Analysis and design for work needed to make Dataflow a better data processing abstraction
layer for multiple open source frameworks and environments
  
  Finally, we have a number of intermediate-term goals:
  
- * Roadmapping, planning, and execution of integrations with other OSS and non-OSS projects/products
+  * Roadmapping, planning, and execution of integrations with other OSS and non-OSS projects/products
- * Inclusion of additional SDK for Python, which is under active development
+  * Inclusion of additional SDK for Python, which is under active development
  
  == Current Status ==
  
@@ -79, +79 @@

  
  The core developers for Dataflow and the Dataflow runners are:
  
- * Frances Perry
+  * Frances Perry
- * Tyler Akidau
+  * Tyler Akidau
- * Davor Bonaci
+  * Davor Bonaci
- * Luke Cwik
+  * Luke Cwik
- * Ben Chambers
+  * Ben Chambers
- * Kenn Knowles
+  * Kenn Knowles
- * Dan Halperin
+  * Dan Halperin
- * Daniel Mills
+  * Daniel Mills
- * Mark Shields
+  * Mark Shields
- * Craig Chambers
+  * Craig Chambers
- * Maximilian Michels
+  * Maximilian Michels
- * Tom White
+  * Tom White
- * Josh Wills
+  * Josh Wills
  
  === Alignment ===
  
@@ -119, +119 @@

  
  Dataflow directly interoperates with or utilizes several existing Apache projects. 
  
- * Build
+  * Build
- ** Apache Maven
+   * Apache Maven
- * Data I/O, Libraries
+  * Data I/O, Libraries
- ** Apache Avro
+   * Apache Avro
- ** Apache Commons
+   * Apache Commons
- * Dataflow runners
+  * Dataflow runners
- ** Apache Flink
+   * Apache Flink
- ** Apache Spark
+   * Apache Spark
  
  Dataflow when used in batch mode shares similarities with Apache Crunch; however, Dataflow
is focused on a model, SDK, and abstraction layer beyond Spark and Hadoop (MapReduce.) One
key goal of Dataflow is to provide an intermediate abstraction layer which can easily be implemented
and utilized across several different processing frameworks.
  
@@ -138, +138 @@

  
  The following documentation is relevant to this proposal. Relevant portion of the documentation
will be contributed to the Apache Dataflow project.
  
- * Dataflow website: https://cloud.google.com/dataflow
+  * Dataflow website: https://cloud.google.com/dataflow
- * Dataflow programming model: https://cloud.google.com/dataflow/model/programming-model
+  * Dataflow programming model: https://cloud.google.com/dataflow/model/programming-model
- * Codebases
+  * Codebases
- ** Dataflow Java SDK: https://github.com/GoogleCloudPlatform/DataflowJavaSDK
+   * Dataflow Java SDK: https://github.com/GoogleCloudPlatform/DataflowJavaSDK
- ** Flink Dataflow runner: https://github.com/dataArtisans/flink-dataflow
+   * Flink Dataflow runner: https://github.com/dataArtisans/flink-dataflow
- ** Spark Dataflow runner: https://github.com/cloudera/spark-dataflow
+   * Spark Dataflow runner: https://github.com/cloudera/spark-dataflow
- * Dataflow Java SDK issue tracker: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
+  * Dataflow Java SDK issue tracker: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
- * google-cloud-dataflow tag on Stack Overflow: http://stackoverflow.com/questions/tagged/google-cloud-dataflow
+  * google-cloud-dataflow tag on Stack Overflow: http://stackoverflow.com/questions/tagged/google-cloud-dataflow
  
  == Initial Source ==
  
  The initial source for Dataflow which we will submit to the Apache Foundation will include
several related projects which are currently hosted on the GitHub repositories: 
  
- * Dataflow Java SDK (https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
+  * Dataflow Java SDK (https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
- * Flink Dataflow runner (https://github.com/dataArtisans/flink-dataflow) 
+  * Flink Dataflow runner (https://github.com/dataArtisans/flink-dataflow) 
- * Spark Dataflow runner (https://github.com/cloudera/spark-dataflow)
+  * Spark Dataflow runner (https://github.com/cloudera/spark-dataflow)
  
  These projects have always been Apache 2.0 licensed. We intend to bundle all of these repositories
since they are all complimentary and should be maintained in one project. Prior to our submission,
we will combine all of these projects into a new git repository.  
  
@@ -161, +161 @@

  
  The source for the Dataflow SDK and the three runners (Spark, Flink, Google Cloud Dataflow)
are already licensed under an Apache 2 license.
  
- * Dataflow SDK - https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE
+  * Dataflow SDK - https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE
- * Flink runner - https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
+  * Flink runner - https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
- * Spark runner - https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
+  * Spark runner - https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
  
  Contributors to the Dataflow SDK have also signed the Google Individual Contributor License
Agreement (https://cla.developers.google.com/about/google-individual) in order to contribute
to the project.
  
@@ -179, +179 @@

  
  We currently use a mix of mailing lists. We will migrate our existing mailing lists to the
following:
  
- * dev@dataflow.incubator.apache.org
+  * dev@dataflow.incubator.apache.org
- * user@dataflow.incubator.apache.org
+  * user@dataflow.incubator.apache.org
- * private@dataflow.incubator.apache.org
+  * private@dataflow.incubator.apache.org
- * commits@dataflow.incubator.apache.org
+  * commits@dataflow.incubator.apache.org
  
  === Source Control ===
  
@@ -194, +194 @@

  
  == Initial Committers ==
  
- * Aljoscha Krettek        [aljoscha@apache.org]
+  * Aljoscha Krettek        [aljoscha@apache.org]
- * Amit Sela               [amitsela33@gmail.com]
+  * Amit Sela               [amitsela33@gmail.com]
- * Ben Chambers            [bchambers@google.com]
+  * Ben Chambers            [bchambers@google.com]
- * Craig Chambers          [chambers@google.com]
+  * Craig Chambers          [chambers@google.com]
- * Dan Halperin            [dhalperi@google.com]
+  * Dan Halperin            [dhalperi@google.com]
- * Davor Bonaci            [davor@google.com]
+  * Davor Bonaci            [davor@google.com]
- * Frances Perry           [fjp@google.com]
+  * Frances Perry           [fjp@google.com]
- * James Malone            [jamesmalone@google.com]
+  * James Malone            [jamesmalone@google.com]
- * Jean-Baptiste Onofré    [jbonofre@apache.org]
+  * Jean-Baptiste Onofré    [jbonofre@apache.org]
- * Josh Wills              [jwills@apache.org]
+  * Josh Wills              [jwills@apache.org]
- * Kostas Tzoumas          [kostas@data-artisans.com]
+  * Kostas Tzoumas          [kostas@data-artisans.com]
- * Kenneth Knowles         [klk@google.com]
+  * Kenneth Knowles         [klk@google.com]
- * Luke Cwik               [lcwik@google.com]
+  * Luke Cwik               [lcwik@google.com]
- * Maximilian Michels      [mxm@apache.org]
+  * Maximilian Michels      [mxm@apache.org]
- * Stephan Ewen            [stephan@data-artisans.com]
+  * Stephan Ewen            [stephan@data-artisans.com]
- * Tom White               [tom@cloudera.com]
+  * Tom White               [tom@cloudera.com]
- * Tyler Akidau            [takidau@google.com]
+  * Tyler Akidau            [takidau@google.com]
  
  == Affiliations ==
  
  The initial committers are from six organizations. Google developed Dataflow and the Dataflow
SDK, data Artisans developed the Flink runner, and Cloudera (Labs) developed the Spark runner.
  
- * Cloudera
+  * Cloudera
- ** Tom White
+   * Tom White
- * Data Artisans
+  * Data Artisans
- ** Aljoscha Krettek
+   * Aljoscha Krettek
- ** Kostas Tzoumas
+   * Kostas Tzoumas
- ** Maximilian Michels
+   * Maximilian Michels
- ** Stephan Ewen
+   * Stephan Ewen
- * Google
+  * Google
- ** Ben Chambers
+   * Ben Chambers
- ** Dan Halperin
+   * Dan Halperin
- ** Davor Bonaci
+   * Davor Bonaci
- ** Frances Perry
+   * Frances Perry
- ** James Malone
+   * James Malone
- ** Kenneth Knowles
+   * Kenneth Knowles
- ** Luke Cwik
+   * Luke Cwik
- ** Tyler Akidau
+   * Tyler Akidau
- * PayPal
+  * PayPal
- ** Amit Sela
+   * Amit Sela
- * Slack
+  * Slack
- ** Josh Wills
+   * Josh Wills
- * Talend
+  * Talend
- ** Jean-Baptiste Onofré
+   * Jean-Baptiste Onofré
  
  == Sponsors ==
  
  === Champion ===
  
- * Jean-Baptiste Onofre         [jbonofre@apache.org]
+  * Jean-Baptiste Onofre         [jbonofre@apache.org]
  
  === Nominated Mentors ===
  
- * Jim Jagielski              [jim@apache.org]
+  * Jim Jagielski              [jim@apache.org]
- * Venkatesh Seetharam        [venkatesh@apache.org]
+  * Venkatesh Seetharam        [venkatesh@apache.org]
- * Bertrand Delacretaz        [bdelacretaz@apache.org]
+  * Bertrand Delacretaz        [bdelacretaz@apache.org]
- * Ted Dunning                [tdunning@apache.org]
+  * Ted Dunning                [tdunning@apache.org]
  
  === Sponsoring Entity ===
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message