incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré ...@nanthrax.net>
Subject Re: [DISCUSS] Apache Dataflow Incubator Proposal
Date Thu, 21 Jan 2016 16:07:12 GMT
Hey Alex,

awesome: I added you on the proposal.

Thanks,
Regards
JB

On 01/21/2016 05:03 PM, Alexander Bezzubov wrote:
> Hi,
>
> it's great to see DataFlow becoming part to Apache ecosystem, thank you
> bringing it in.
> I would be happy to get involved and help.
>
> --
> Alex
>
> On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré <jb@nanthrax.net>
> wrote:
>
>> Perfect: done, you are on the proposal.
>>
>> Thanks !
>> Regards
>> JB
>>
>>
>> On 01/21/2016 11:55 AM, chatz wrote:
>>
>>> Charitha Elvitigala
>>>
>>> On 21 January 2016 at 16:17, Jean-Baptiste Onofré <jb@nanthrax.net>
>>> wrote:
>>>
>>> Hi Chatz,
>>>>
>>>> sure, what name should I use on the proposal, Charitha ?
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 01/21/2016 11:32 AM, chatz wrote:
>>>>
>>>> Hi Jean,
>>>>>
>>>>> I’d be interested in contributing as well.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Chatz
>>>>>
>>>>>
>>>>> On 21 January 2016 at 14:22, Jean-Baptiste Onofré <jb@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>> Sweet: you are on the proposal ;)
>>>>>
>>>>>>
>>>>>> Thanks !
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
>>>>>>
>>>>>> This looks very interesting. I'm interested in contributing.
>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>> -Gon
>>>>>>>
>>>>>>> ---
>>>>>>> Byung-Gon Chun
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
>>>>>>> jamesmalone@google.com.invalid> wrote:
>>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>>
>>>>>>>> Attached to this message is a proposed new project - Apache
>>>>>>>> Dataflow, a
>>>>>>>> unified programming model for data processing and integration.
>>>>>>>>
>>>>>>>> The text of the proposal is included below. Additionally,
the
>>>>>>>> proposal
>>>>>>>> is
>>>>>>>> in draft form on the wiki where we will make any required
changes:
>>>>>>>>
>>>>>>>> https://wiki.apache.org/incubator/DataflowProposal
>>>>>>>>
>>>>>>>> We look forward to your feedback and input.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>> ----
>>>>>>>>
>>>>>>>> = Apache Dataflow =
>>>>>>>>
>>>>>>>> == Abstract ==
>>>>>>>>
>>>>>>>> Dataflow is an open source, unified model and set of
>>>>>>>> language-specific
>>>>>>>> SDKs
>>>>>>>> for defining and executing data processing workflows, and
also data
>>>>>>>> ingestion and integration flows, supporting Enterprise Integration
>>>>>>>> Patterns
>>>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>>>>>>> simplify
>>>>>>>> the mechanics of large-scale batch and streaming data processing
and
>>>>>>>> can
>>>>>>>> run on a number of runtimes like Apache Flink, Apache Spark,
and
>>>>>>>> Google
>>>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL
in
>>>>>>>> different
>>>>>>>> languages, allowing users to easily implement their data
integration
>>>>>>>> processes.
>>>>>>>>
>>>>>>>> == Proposal ==
>>>>>>>>
>>>>>>>> Dataflow is a simple, flexible, and powerful system for distributed
>>>>>>>> data
>>>>>>>> processing at any scale. Dataflow provides a unified programming
>>>>>>>> model, a
>>>>>>>> software development kit to define and construct data processing
>>>>>>>> pipelines,
>>>>>>>> and runners to execute Dataflow pipelines in several runtime
engines,
>>>>>>>> like
>>>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow
can be
>>>>>>>> used
>>>>>>>> for a variety of streaming or batch data processing goals
including
>>>>>>>> ETL,
>>>>>>>> stream analysis, and aggregate computation. The underlying
>>>>>>>> programming
>>>>>>>> model for Dataflow provides MapReduce-like parallelism, combined
with
>>>>>>>> support for powerful data windowing, and fine-grained correctness
>>>>>>>> control.
>>>>>>>>
>>>>>>>> == Background ==
>>>>>>>>
>>>>>>>> Dataflow started as a set of Google projects focused on making
data
>>>>>>>> processing easier, faster, and less costly. The Dataflow
model is a
>>>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google
and is
>>>>>>>> focused on providing a unified solution for batch and stream
>>>>>>>> processing.
>>>>>>>> These projects on which Dataflow is based have been published
in
>>>>>>>> several
>>>>>>>> papers made available to the public:
>>>>>>>>
>>>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>>>>>>
>>>>>>>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>>>>>
>>>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>>>>>>
>>>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>>>>>>
>>>>>>>> Dataflow was designed from the start to provide a portable
>>>>>>>> programming
>>>>>>>> layer. When you define a data processing pipeline with the
Dataflow
>>>>>>>> model,
>>>>>>>> you are creating a job which is capable of being processed
by any
>>>>>>>> number
>>>>>>>> of
>>>>>>>> Dataflow processing engines. Several engines have been developed
to
>>>>>>>> run
>>>>>>>> Dataflow pipelines in other open source runtimes, including
a
>>>>>>>> Dataflow
>>>>>>>> runner for Apache Flink and Apache Spark. There is also a
“direct
>>>>>>>> runner”,
>>>>>>>> for execution on the developer machine (mainly for dev/debug
>>>>>>>> purposes).
>>>>>>>> Another runner allows a Dataflow program to run on a managed
service,
>>>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow
Java
>>>>>>>> SDK
>>>>>>>> is
>>>>>>>> already available on GitHub, and independent from the Google
Cloud
>>>>>>>> Dataflow
>>>>>>>> service. Another Python SDK is currently in active development.
>>>>>>>>
>>>>>>>> In this proposal, the Dataflow SDKs, model, and a set of
runners will
>>>>>>>> be
>>>>>>>> submitted as an OSS project under the ASF. The runners which
are a
>>>>>>>> part
>>>>>>>> of
>>>>>>>> this proposal include those for Spark (from Cloudera), Flink
(from
>>>>>>>> data
>>>>>>>> Artisans), and local development (from Google); the Google
Cloud
>>>>>>>> Dataflow
>>>>>>>> service runner is not included in this proposal. Further
references
>>>>>>>> to
>>>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners
which
>>>>>>>> are
>>>>>>>> a
>>>>>>>> part of this proposal (Apache Dataflow) only. The initial
submission
>>>>>>>> will
>>>>>>>> contain the already-released Java SDK; Google intends to
submit the
>>>>>>>> Python
>>>>>>>> SDK later in the incubation process. The Google Cloud Dataflow
>>>>>>>> service
>>>>>>>> will
>>>>>>>> continue to be one of many runners for Dataflow, built on
Google
>>>>>>>> Cloud
>>>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow
will
>>>>>>>> develop against the Apache project additions, updates, and
changes.
>>>>>>>> Google
>>>>>>>> Cloud Dataflow will become one user of Apache Dataflow and
will
>>>>>>>> participate
>>>>>>>> in the project openly and publicly.
>>>>>>>>
>>>>>>>> The Dataflow programming model has been designed with simplicity,
>>>>>>>> scalability, and speed as key tenants. In the Dataflow model,
you
>>>>>>>> only
>>>>>>>> need
>>>>>>>> to think about four top-level concepts when constructing
your data
>>>>>>>> processing job:
>>>>>>>>
>>>>>>>> * Pipelines - The data processing job made of a series of
>>>>>>>> computations
>>>>>>>> including input, processing, and output
>>>>>>>>
>>>>>>>> * PCollections - Bounded (or unbounded) datasets which represent
the
>>>>>>>> input,
>>>>>>>> intermediate and output data in pipelines
>>>>>>>>
>>>>>>>> * PTransforms - A data processing step in a pipeline in which
one or
>>>>>>>> more
>>>>>>>> PCollections are an input and output
>>>>>>>>
>>>>>>>> * I/O Sources and Sinks - APIs for reading and writing data
which are
>>>>>>>> the
>>>>>>>> roots and endpoints of the pipeline
>>>>>>>>
>>>>>>>> == Rationale ==
>>>>>>>>
>>>>>>>> With Dataflow, Google intended to develop a framework which
allowed
>>>>>>>> developers to be maximally productive in defining the processing,
and
>>>>>>>> then
>>>>>>>> be able to execute the program at various levels of
>>>>>>>> latency/cost/completeness without re-architecting or re-writing
it.
>>>>>>>> This
>>>>>>>> goal was informed by Google’s past experience  developing
several
>>>>>>>> models,
>>>>>>>> frameworks, and tools useful for large-scale and distributed
data
>>>>>>>> processing. While Google has previously published papers
describing
>>>>>>>> some
>>>>>>>> of
>>>>>>>> its technologies, Google decided to take a different approach
with
>>>>>>>> Dataflow. Google open-sourced the SDK and model alongside
>>>>>>>> commercialization
>>>>>>>> of the idea and ahead of publishing papers on the topic.
As a
>>>>>>>> result, a
>>>>>>>> number of open source runtimes exist for Dataflow, such as
the Apache
>>>>>>>> Flink
>>>>>>>> and Apache Spark runners.
>>>>>>>>
>>>>>>>> We believe that submitting Dataflow as an Apache project
will provide
>>>>>>>> an
>>>>>>>> immediate, worthwhile, and substantial contribution to the
open
>>>>>>>> source
>>>>>>>> community. As an incubating project, we believe Dataflow
will have a
>>>>>>>> better
>>>>>>>> opportunity to provide a meaningful contribution to OSS and
also
>>>>>>>> integrate
>>>>>>>> with other Apache projects.
>>>>>>>>
>>>>>>>> In the long term, we believe Dataflow can be a powerful abstraction
>>>>>>>> layer
>>>>>>>> for data processing. By providing an abstraction layer for
data
>>>>>>>> pipelines
>>>>>>>> and processing, data workflows can be increasingly portable,
>>>>>>>> resilient
>>>>>>>> to
>>>>>>>> breaking changes in tooling, and compatible across many execution
>>>>>>>> engines,
>>>>>>>> runtimes, and open source projects.
>>>>>>>>
>>>>>>>> == Initial Goals ==
>>>>>>>>
>>>>>>>> We are breaking our initial goals into immediate (< 2
months),
>>>>>>>> short-term
>>>>>>>> (2-4 months), and intermediate-term (> 4 months).
>>>>>>>>
>>>>>>>> Our immediate goals include the following:
>>>>>>>>
>>>>>>>> * Plan for reconciling the Dataflow Java SDK and various
runners into
>>>>>>>> one
>>>>>>>> project
>>>>>>>>
>>>>>>>> * Plan for refactoring the existing Java SDK for better extensibility
>>>>>>>> by
>>>>>>>> SDK and runner writers
>>>>>>>>
>>>>>>>> * Validating all dependencies are ASL 2.0 or compatible
>>>>>>>>
>>>>>>>> * Understanding and adapting to the Apache development process
>>>>>>>>
>>>>>>>> Our short-term goals include:
>>>>>>>>
>>>>>>>> * Moving the newly-merged lists, and build utilities to Apache
>>>>>>>>
>>>>>>>> * Start refactoring codebase and move code to Apache Git
repo
>>>>>>>>
>>>>>>>> * Continue development of new features, functions, and fixes
in the
>>>>>>>> Dataflow Java SDK, and Dataflow runners
>>>>>>>>
>>>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap
and
>>>>>>>> plan
>>>>>>>> for
>>>>>>>> how to include new major ideas, modules, and runtimes
>>>>>>>>
>>>>>>>> * Establishment of easy and clear build/test framework for
Dataflow
>>>>>>>> and
>>>>>>>> associated runtimes; creation of testing, rollback, and validation
>>>>>>>> policy
>>>>>>>>
>>>>>>>> * Analysis and design for work needed to make Dataflow a
better data
>>>>>>>> processing abstraction layer for multiple open source frameworks
and
>>>>>>>> environments
>>>>>>>>
>>>>>>>> Finally, we have a number of intermediate-term goals:
>>>>>>>>
>>>>>>>> * Roadmapping, planning, and execution of integrations with
other OSS
>>>>>>>> and
>>>>>>>> non-OSS projects/products
>>>>>>>>
>>>>>>>> * Inclusion of additional SDK for Python, which is under
active
>>>>>>>> development
>>>>>>>>
>>>>>>>> == Current Status ==
>>>>>>>>
>>>>>>>> === Meritocracy ===
>>>>>>>>
>>>>>>>> Dataflow was initially developed based on ideas from many
employees
>>>>>>>> within
>>>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK
has
>>>>>>>> received
>>>>>>>> contributions from data Artisans, Cloudera Labs, and other
individual
>>>>>>>> developers. As a project under incubation, we are committed
to
>>>>>>>> expanding
>>>>>>>> our effort to build an environment which supports a meritocracy.
We
>>>>>>>> are
>>>>>>>> focused on engaging the community and other related projects
for
>>>>>>>> support
>>>>>>>> and contributions. Moreover, we are committed to ensure contributors
>>>>>>>> and
>>>>>>>> committers to Dataflow come from a broad mix of organizations
>>>>>>>> through a
>>>>>>>> merit-based decision process during incubation. We believe
strongly
>>>>>>>> in
>>>>>>>> the
>>>>>>>> Dataflow model and are committed to growing an inclusive
community of
>>>>>>>> Dataflow contributors.
>>>>>>>>
>>>>>>>> === Community ===
>>>>>>>>
>>>>>>>> The core of the Dataflow Java SDK has been developed by Google
for
>>>>>>>> use
>>>>>>>> with
>>>>>>>> Google Cloud Dataflow. Google has active community engagement
in the
>>>>>>>> SDK
>>>>>>>> GitHub repository (
>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>> ),
>>>>>>>> on Stack Overflow (
>>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
and
>>>>>>>> has
>>>>>>>> had contributions from a number of organizations and indivuduals.
>>>>>>>>
>>>>>>>> Everyday, Cloud Dataflow is actively used by a number of
>>>>>>>> organizations
>>>>>>>> and
>>>>>>>> institutions for batch and stream processing of data. We
believe
>>>>>>>> acceptance
>>>>>>>> will allow us to consolidate existing Dataflow-related work,
grow the
>>>>>>>> Dataflow community, and deepen connections between Dataflow
and other
>>>>>>>> open
>>>>>>>> source projects.
>>>>>>>>
>>>>>>>> === Core Developers ===
>>>>>>>>
>>>>>>>> The core developers for Dataflow and the Dataflow runners
are:
>>>>>>>>
>>>>>>>> * Frances Perry
>>>>>>>>
>>>>>>>> * Tyler Akidau
>>>>>>>>
>>>>>>>> * Davor Bonaci
>>>>>>>>
>>>>>>>> * Luke Cwik
>>>>>>>>
>>>>>>>> * Ben Chambers
>>>>>>>>
>>>>>>>> * Kenn Knowles
>>>>>>>>
>>>>>>>> * Dan Halperin
>>>>>>>>
>>>>>>>> * Daniel Mills
>>>>>>>>
>>>>>>>> * Mark Shields
>>>>>>>>
>>>>>>>> * Craig Chambers
>>>>>>>>
>>>>>>>> * Maximilian Michels
>>>>>>>>
>>>>>>>> * Tom White
>>>>>>>>
>>>>>>>> * Josh Wills
>>>>>>>>
>>>>>>>> === Alignment ===
>>>>>>>>
>>>>>>>> The Dataflow SDK can be used to create Dataflow pipelines
which can
>>>>>>>> be
>>>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also
related to
>>>>>>>> other
>>>>>>>> Apache projects, such as Apache Crunch. We plan on expanding
>>>>>>>> functionality
>>>>>>>> for Dataflow runners, support for additional domain specific
>>>>>>>> languages,
>>>>>>>> and
>>>>>>>> increased portability so Dataflow is a powerful abstraction
layer for
>>>>>>>> data
>>>>>>>> processing.
>>>>>>>>
>>>>>>>> == Known Risks ==
>>>>>>>>
>>>>>>>> === Orphaned Products ===
>>>>>>>>
>>>>>>>> The Dataflow SDK is presently used by several organizations,
from
>>>>>>>> small
>>>>>>>> startups to Fortune 100 companies, to construct production
pipelines
>>>>>>>> which
>>>>>>>> are executed in Google Cloud Dataflow. Google has a long-term
>>>>>>>> commitment
>>>>>>>> to
>>>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing
>>>>>>>> interest,
>>>>>>>> development, and adoption from organizations outside of Google.
>>>>>>>>
>>>>>>>> === Inexperience with Open Source ===
>>>>>>>>
>>>>>>>> Google believes strongly in open source and the exchange
of
>>>>>>>> information
>>>>>>>> to
>>>>>>>> advance new ideas and work. Examples of this commitment are
active
>>>>>>>> OSS
>>>>>>>> projects such as Chromium (https://www.chromium.org) and
Kubernetes
>>>>>>>> (
>>>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be
>>>>>>>> increasingly
>>>>>>>> open and forward-looking; we have published a paper in the
VLDB
>>>>>>>> conference
>>>>>>>> describing the Dataflow model (
>>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were
quick to
>>>>>>>> release
>>>>>>>> the Dataflow SDK as open source software with the launch
of Cloud
>>>>>>>> Dataflow.
>>>>>>>> Our submission to the Apache Software Foundation is a logical
>>>>>>>> extension
>>>>>>>> of
>>>>>>>> our commitment to open source software.
>>>>>>>>
>>>>>>>> === Homogeneous Developers ===
>>>>>>>>
>>>>>>>> The majority of committers in this proposal belong to Google
due to
>>>>>>>> the
>>>>>>>> fact that Dataflow has emerged from several internal Google
projects.
>>>>>>>> This
>>>>>>>> proposal also includes committers outside of Google who are
actively
>>>>>>>> involved with other Apache projects, such as Hadoop, Flink,
and
>>>>>>>> Spark.
>>>>>>>> We
>>>>>>>> expect our entry into incubation will allow us to expand
the number
>>>>>>>> of
>>>>>>>> individuals and organizations participating in Dataflow development.
>>>>>>>> Additionally, separation of the Dataflow SDK from Google
Cloud
>>>>>>>> Dataflow
>>>>>>>> allows us to focus on the open source SDK and model and do
what is
>>>>>>>> best
>>>>>>>> for
>>>>>>>> this project.
>>>>>>>>
>>>>>>>> === Reliance on Salaried Developers ===
>>>>>>>>
>>>>>>>> The Dataflow SDK and Dataflow runners have been developed
primarily
>>>>>>>> by
>>>>>>>> salaried developers supporting the Google Cloud Dataflow
project.
>>>>>>>> While
>>>>>>>> the
>>>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different
>>>>>>>> teams
>>>>>>>> (and
>>>>>>>> this proposal would reinforce that separation) we expect
our initial
>>>>>>>> set
>>>>>>>> of
>>>>>>>> developers will still primarily be salaried. Contribution
has not
>>>>>>>> been
>>>>>>>> exclusively from salaried developers, however. For example,
the
>>>>>>>> contrib
>>>>>>>> directory of the Dataflow SDK (
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contrib
>>>>>>>> )
>>>>>>>> contains items from free-time contributors. Moreover, seperate
>>>>>>>> projects,
>>>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have
been
>>>>>>>> created
>>>>>>>> around the Dataflow model and SDK. We expect our reliance
on salaried
>>>>>>>> developers will decrease over time during incubation.
>>>>>>>>
>>>>>>>> === Relationship with other Apache products ===
>>>>>>>>
>>>>>>>> Dataflow directly interoperates with or utilizes several
existing
>>>>>>>> Apache
>>>>>>>> projects.
>>>>>>>>
>>>>>>>> * Build
>>>>>>>>
>>>>>>>> ** Apache Maven
>>>>>>>>
>>>>>>>> * Data I/O, Libraries
>>>>>>>>
>>>>>>>> ** Apache Avro
>>>>>>>>
>>>>>>>> ** Apache Commons
>>>>>>>>
>>>>>>>> * Dataflow runners
>>>>>>>>
>>>>>>>> ** Apache Flink
>>>>>>>>
>>>>>>>> ** Apache Spark
>>>>>>>>
>>>>>>>> Dataflow when used in batch mode shares similarities with
Apache
>>>>>>>> Crunch;
>>>>>>>> however, Dataflow is focused on a model, SDK, and abstraction
layer
>>>>>>>> beyond
>>>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is
to provide
>>>>>>>> an
>>>>>>>> intermediate abstraction layer which can easily be implemented
and
>>>>>>>> utilized
>>>>>>>> across several different processing frameworks.
>>>>>>>>
>>>>>>>> === An excessive fascination with the Apache brand ===
>>>>>>>>
>>>>>>>> With this proposal we are not seeking attention or publicity.
Rather,
>>>>>>>> we
>>>>>>>> firmly believe in the Dataflow model, SDK, and the ability
to make
>>>>>>>> Dataflow
>>>>>>>> a powerful yet simple framework for data processing. While
the
>>>>>>>> Dataflow
>>>>>>>> SDK
>>>>>>>> and model have been open source, we believe putting code
on GitHub
>>>>>>>> can
>>>>>>>> only
>>>>>>>> go so far. We see the Apache community, processes, and mission
as
>>>>>>>> critical
>>>>>>>> for ensuring the Dataflow SDK and model are truly community-driven,
>>>>>>>> positively impactful, and innovative open source software.
While
>>>>>>>> Google
>>>>>>>> has
>>>>>>>> taken a number of steps to advance its various open source
projects,
>>>>>>>> we
>>>>>>>> believe Dataflow is a great fit for the Apache Software Foundation
>>>>>>>> due
>>>>>>>> to
>>>>>>>> its focus on data processing and its relationships to existing
ASF
>>>>>>>> projects.
>>>>>>>>
>>>>>>>> == Documentation ==
>>>>>>>>
>>>>>>>> The following documentation is relevant to this proposal.
Relevant
>>>>>>>> portion
>>>>>>>> of the documentation will be contributed to the Apache Dataflow
>>>>>>>> project.
>>>>>>>>
>>>>>>>> * Dataflow website: https://cloud.google.com/dataflow
>>>>>>>>
>>>>>>>> * Dataflow programming model:
>>>>>>>> https://cloud.google.com/dataflow/model/programming-model
>>>>>>>>
>>>>>>>> * Codebases
>>>>>>>>
>>>>>>>> ** Dataflow Java SDK:
>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>>
>>>>>>>> ** Flink Dataflow runner:
>>>>>>>> https://github.com/dataArtisans/flink-dataflow
>>>>>>>>
>>>>>>>> ** Spark Dataflow runner: https://github.com/cloudera/spark-dataflow
>>>>>>>>
>>>>>>>> * Dataflow Java SDK issue tracker:
>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
>>>>>>>>
>>>>>>>> * google-cloud-dataflow tag on Stack Overflow:
>>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow
>>>>>>>>
>>>>>>>> == Initial Source ==
>>>>>>>>
>>>>>>>> The initial source for Dataflow which we will submit to the
Apache
>>>>>>>> Foundation will include several related projects which are
currently
>>>>>>>> hosted
>>>>>>>> on the GitHub repositories:
>>>>>>>>
>>>>>>>> * Dataflow Java SDK (
>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
>>>>>>>>
>>>>>>>> * Flink Dataflow runner (
>>>>>>>> https://github.com/dataArtisans/flink-dataflow)
>>>>>>>>
>>>>>>>> * Spark Dataflow runner (https://github.com/cloudera/spark-dataflow)
>>>>>>>>
>>>>>>>> These projects have always been Apache 2.0 licensed. We intend
to
>>>>>>>> bundle
>>>>>>>> all of these repositories since they are all complimentary
and should
>>>>>>>> be
>>>>>>>> maintained in one project. Prior to our submission, we will
combine
>>>>>>>> all
>>>>>>>> of
>>>>>>>> these projects into a new git repository.
>>>>>>>>
>>>>>>>> == Source and Intellectual Property Submission Plan ==
>>>>>>>>
>>>>>>>> The source for the Dataflow SDK and the three runners (Spark,
Flink,
>>>>>>>> Google
>>>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license.
>>>>>>>>
>>>>>>>> * Dataflow SDK -
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE
>>>>>>>>
>>>>>>>> * Flink runner -
>>>>>>>> https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
>>>>>>>>
>>>>>>>> * Spark runner -
>>>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
>>>>>>>>
>>>>>>>> Contributors to the Dataflow SDK have also signed the Google
>>>>>>>> Individual
>>>>>>>> Contributor License Agreement (
>>>>>>>> https://cla.developers.google.com/about/google-individual)
in order
>>>>>>>> to
>>>>>>>> contribute to the project.
>>>>>>>>
>>>>>>>> With respect to trademark rights, Google does not hold a
trademark on
>>>>>>>> the
>>>>>>>> phrase “Dataflow.” Based on feedback and guidance we
receive during
>>>>>>>> the
>>>>>>>> incubation process, we are open to renaming the project if
necessary
>>>>>>>> for
>>>>>>>> trademark or other concerns.
>>>>>>>>
>>>>>>>> == External Dependencies ==
>>>>>>>>
>>>>>>>> All external dependencies are licensed under an Apache 2.0
or
>>>>>>>> Apache-compatible license. As we grow the Dataflow community
we will
>>>>>>>> configure our build process to require and validate all contributions
>>>>>>>> and
>>>>>>>> dependencies are licensed under the Apache 2.0 license or
are under
>>>>>>>> an
>>>>>>>> Apache-compatible license.
>>>>>>>>
>>>>>>>> == Required Resources ==
>>>>>>>>
>>>>>>>> === Mailing Lists ===
>>>>>>>>
>>>>>>>> We currently use a mix of mailing lists. We will migrate
our existing
>>>>>>>> mailing lists to the following:
>>>>>>>>
>>>>>>>> * dev@dataflow.incubator.apache.org
>>>>>>>>
>>>>>>>> * user@dataflow.incubator.apache.org
>>>>>>>>
>>>>>>>> * private@dataflow.incubator.apache.org
>>>>>>>>
>>>>>>>> * commits@dataflow.incubator.apache.org
>>>>>>>>
>>>>>>>> === Source Control ===
>>>>>>>>
>>>>>>>> The Dataflow team currently uses Git and would like to continue
to do
>>>>>>>> so.
>>>>>>>> We request a Git repository for Dataflow with mirroring to
GitHub
>>>>>>>> enabled.
>>>>>>>>
>>>>>>>> === Issue Tracking ===
>>>>>>>>
>>>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow
>>>>>>>> project
>>>>>>>> is
>>>>>>>> currently using both a public GitHub issue tracker and internal
>>>>>>>> Google
>>>>>>>> issue tracking. We will migrate and combine from these two
sources to
>>>>>>>> the
>>>>>>>> Apache JIRA.
>>>>>>>>
>>>>>>>> == Initial Committers ==
>>>>>>>>
>>>>>>>> * Aljoscha Krettek     [aljoscha@apache.org]
>>>>>>>>
>>>>>>>> * Amit Sela            [amitsela33@gmail.com]
>>>>>>>>
>>>>>>>> * Ben Chambers         [bchambers@google.com]
>>>>>>>>
>>>>>>>> * Craig Chambers       [chambers@google.com]
>>>>>>>>
>>>>>>>> * Dan Halperin         [dhalperi@google.com]
>>>>>>>>
>>>>>>>> * Davor Bonaci         [davor@google.com]
>>>>>>>>
>>>>>>>> * Frances Perry        [fjp@google.com]
>>>>>>>>
>>>>>>>> * James Malone         [jamesmalone@google.com]
>>>>>>>>
>>>>>>>> * Jean-Baptiste Onofré [jbonofre@apache.org]
>>>>>>>>
>>>>>>>> * Josh Wills           [jwills@apache.org]
>>>>>>>>
>>>>>>>> * Kostas Tzoumas       [kostas@data-artisans.com]
>>>>>>>>
>>>>>>>> * Kenneth Knowles      [klk@google.com]
>>>>>>>>
>>>>>>>> * Luke Cwik            [lcwik@google.com]
>>>>>>>>
>>>>>>>> * Maximilian Michels   [mxm@apache.org]
>>>>>>>>
>>>>>>>> * Stephan Ewen         [stephan@data-artisans.com]
>>>>>>>>
>>>>>>>> * Tom White            [tom@cloudera.com]
>>>>>>>>
>>>>>>>> * Tyler Akidau         [takidau@google.com]
>>>>>>>>
>>>>>>>> == Affiliations ==
>>>>>>>>
>>>>>>>> The initial committers are from six organizations. Google
developed
>>>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the
Flink
>>>>>>>> runner,
>>>>>>>> and Cloudera (Labs) developed the Spark runner.
>>>>>>>>
>>>>>>>> * Cloudera
>>>>>>>>
>>>>>>>> ** Tom White
>>>>>>>>
>>>>>>>> * Data Artisans
>>>>>>>>
>>>>>>>> ** Aljoscha Krettek
>>>>>>>>
>>>>>>>> ** Kostas Tzoumas
>>>>>>>>
>>>>>>>> ** Maximilian Michels
>>>>>>>>
>>>>>>>> ** Stephan Ewen
>>>>>>>>
>>>>>>>> * Google
>>>>>>>>
>>>>>>>> ** Ben Chambers
>>>>>>>>
>>>>>>>> ** Dan Halperin
>>>>>>>>
>>>>>>>> ** Davor Bonaci
>>>>>>>>
>>>>>>>> ** Frances Perry
>>>>>>>>
>>>>>>>> ** James Malone
>>>>>>>>
>>>>>>>> ** Kenneth Knowles
>>>>>>>>
>>>>>>>> ** Luke Cwik
>>>>>>>>
>>>>>>>> ** Tyler Akidau
>>>>>>>>
>>>>>>>> * PayPal
>>>>>>>>
>>>>>>>> ** Amit Sela
>>>>>>>>
>>>>>>>> * Slack
>>>>>>>>
>>>>>>>> ** Josh Wills
>>>>>>>>
>>>>>>>> * Talend
>>>>>>>>
>>>>>>>> ** Jean-Baptiste Onofré
>>>>>>>>
>>>>>>>> == Sponsors ==
>>>>>>>>
>>>>>>>> === Champion ===
>>>>>>>>
>>>>>>>> * Jean-Baptiste Onofre      [jbonofre@apache.org]
>>>>>>>>
>>>>>>>> === Nominated Mentors ===
>>>>>>>>
>>>>>>>> * Jim Jagielski           [jim@apache.org]
>>>>>>>>
>>>>>>>> * Venkatesh Seetharam     [venkatesh@apache.org]
>>>>>>>>
>>>>>>>> * Bertrand Delacretaz     [bdelacretaz@apache.org]
>>>>>>>>
>>>>>>>> * Ted Dunning             [tdunning@apache.org]
>>>>>>>>
>>>>>>>> === Sponsoring Entity ===
>>>>>>>>
>>>>>>>> The Apache Incubator
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>> Jean-Baptiste Onofré
>>>>>> jbonofre@apache.org
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>
>>>>
>>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message