incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré ...@nanthrax.net>
Subject Re: [DISCUSS] Apache Dataflow Incubator Proposal
Date Wed, 20 Jan 2016 21:08:54 GMT
Great: you are on the proposal.

Regards
JB

On 01/20/2016 09:00 PM, Joe Witt wrote:
> Hello
>
> This is a very interesting proposal and concept.  I'd like to contribute.
>
> Thanks
> Joe
>
> On Wed, Jan 20, 2016 at 2:50 PM, James Carman
> <james@carmanconsulting.com> wrote:
>> Of course! I'd be happy to help
>>
>> On Wed, Jan 20, 2016 at 2:02 PM Jean-Baptiste Onofré <jb@nanthrax.net>
>> wrote:
>>
>>> Hi James,
>>>
>>> Can I add your to the proposal ?
>>>
>>> Regards
>>> JB
>>>
>>> On 01/20/2016 07:20 PM, James Carman wrote:
>>>> Well, I for one would be very interested in this project and would be
>>> happy
>>>> to contribute.
>>>>
>>>>
>>>> On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré <jb@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> It's a fair point, but not present in most of the proposals. It's
>>>>> something that we can address in the "Community" section.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 01/20/2016 05:55 PM, Sean Busbey wrote:
>>>>>> Great proposal. I like that your proposal includes a well presented
>>>>>> roadmap, but I don't see any goals that directly address building
a
>>>>> larger
>>>>>> community. Y'all have any ideas around outreach that will help with
>>>>>> adoption?
>>>>>>
>>>>>> As a start, I recommend y'all add a section to the proposal on the
wiki
>>>>>> page for "Additional Interested Contributors" so that folks who want
to
>>>>>> sign up to participate in the project can do so without requesting
>>>>>> additions to the initial committer list.
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>>>>>> jamesmalone@google.com.invalid> wrote:
>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> Attached to this message is a proposed new project - Apache Dataflow,
>>> a
>>>>>>> unified programming model for data processing and integration.
>>>>>>>
>>>>>>> The text of the proposal is included below. Additionally, the
proposal
>>>>> is
>>>>>>> in draft form on the wiki where we will make any required changes:
>>>>>>>
>>>>>>> https://wiki.apache.org/incubator/DataflowProposal
>>>>>>>
>>>>>>> We look forward to your feedback and input.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> = Apache Dataflow =
>>>>>>>
>>>>>>> == Abstract ==
>>>>>>>
>>>>>>> Dataflow is an open source, unified model and set of language-specific
>>>>> SDKs
>>>>>>> for defining and executing data processing workflows, and also
data
>>>>>>> ingestion and integration flows, supporting Enterprise Integration
>>>>> Patterns
>>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>> simplify
>>>>>>> the mechanics of large-scale batch and streaming data processing
and
>>> can
>>>>>>> run on a number of runtimes like Apache Flink, Apache Spark,
and
>>> Google
>>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>> different
>>>>>>> languages, allowing users to easily implement their data integration
>>>>>>> processes.
>>>>>>>
>>>>>>> == Proposal ==
>>>>>>>
>>>>>>> Dataflow is a simple, flexible, and powerful system for distributed
>>> data
>>>>>>> processing at any scale. Dataflow provides a unified programming
>>> model,
>>>>> a
>>>>>>> software development kit to define and construct data processing
>>>>> pipelines,
>>>>>>> and runners to execute Dataflow pipelines in several runtime
engines,
>>>>> like
>>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow
can be
>>>>> used
>>>>>>> for a variety of streaming or batch data processing goals including
>>> ETL,
>>>>>>> stream analysis, and aggregate computation. The underlying programming
>>>>>>> model for Dataflow provides MapReduce-like parallelism, combined
with
>>>>>>> support for powerful data windowing, and fine-grained correctness
>>>>> control.
>>>>>>>
>>>>>>> == Background ==
>>>>>>>
>>>>>>> Dataflow started as a set of Google projects focused on making
data
>>>>>>> processing easier, faster, and less costly. The Dataflow model
is a
>>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google
and is
>>>>>>> focused on providing a unified solution for batch and stream
>>> processing.
>>>>>>> These projects on which Dataflow is based have been published
in
>>> several
>>>>>>> papers made available to the public:
>>>>>>>
>>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>>>>>
>>>>>>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>>>>
>>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>>>>>
>>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>>>>>
>>>>>>> Dataflow was designed from the start to provide a portable programming
>>>>>>> layer. When you define a data processing pipeline with the Dataflow
>>>>> model,
>>>>>>> you are creating a job which is capable of being processed by
any
>>>>> number of
>>>>>>> Dataflow processing engines. Several engines have been developed
to
>>> run
>>>>>>> Dataflow pipelines in other open source runtimes, including a
Dataflow
>>>>>>> runner for Apache Flink and Apache Spark. There is also a “direct
>>>>> runner”,
>>>>>>> for execution on the developer machine (mainly for dev/debug
>>> purposes).
>>>>>>> Another runner allows a Dataflow program to run on a managed
service,
>>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow
Java SDK
>>>>> is
>>>>>>> already available on GitHub, and independent from the Google
Cloud
>>>>> Dataflow
>>>>>>> service. Another Python SDK is currently in active development.
>>>>>>>
>>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners
will
>>> be
>>>>>>> submitted as an OSS project under the ASF. The runners which
are a
>>> part
>>>>> of
>>>>>>> this proposal include those for Spark (from Cloudera), Flink
(from
>>> data
>>>>>>> Artisans), and local development (from Google); the Google Cloud
>>>>> Dataflow
>>>>>>> service runner is not included in this proposal. Further references
to
>>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners
which
>>> are a
>>>>>>> part of this proposal (Apache Dataflow) only. The initial submission
>>>>> will
>>>>>>> contain the already-released Java SDK; Google intends to submit
the
>>>>> Python
>>>>>>> SDK later in the incubation process. The Google Cloud Dataflow
service
>>>>> will
>>>>>>> continue to be one of many runners for Dataflow, built on Google
Cloud
>>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow
will
>>>>>>> develop against the Apache project additions, updates, and changes.
>>>>> Google
>>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will
>>>>> participate
>>>>>>> in the project openly and publicly.
>>>>>>>
>>>>>>> The Dataflow programming model has been designed with simplicity,
>>>>>>> scalability, and speed as key tenants. In the Dataflow model,
you only
>>>>> need
>>>>>>> to think about four top-level concepts when constructing your
data
>>>>>>> processing job:
>>>>>>>
>>>>>>> * Pipelines - The data processing job made of a series of computations
>>>>>>> including input, processing, and output
>>>>>>>
>>>>>>> * PCollections - Bounded (or unbounded) datasets which represent
the
>>>>> input,
>>>>>>> intermediate and output data in pipelines
>>>>>>>
>>>>>>> * PTransforms - A data processing step in a pipeline in which
one or
>>>>> more
>>>>>>> PCollections are an input and output
>>>>>>>
>>>>>>> * I/O Sources and Sinks - APIs for reading and writing data which
are
>>>>> the
>>>>>>> roots and endpoints of the pipeline
>>>>>>>
>>>>>>> == Rationale ==
>>>>>>>
>>>>>>> With Dataflow, Google intended to develop a framework which allowed
>>>>>>> developers to be maximally productive in defining the processing,
and
>>>>> then
>>>>>>> be able to execute the program at various levels of
>>>>>>> latency/cost/completeness without re-architecting or re-writing
it.
>>> This
>>>>>>> goal was informed by Google’s past experience  developing several
>>>>> models,
>>>>>>> frameworks, and tools useful for large-scale and distributed
data
>>>>>>> processing. While Google has previously published papers describing
>>>>> some of
>>>>>>> its technologies, Google decided to take a different approach
with
>>>>>>> Dataflow. Google open-sourced the SDK and model alongside
>>>>> commercialization
>>>>>>> of the idea and ahead of publishing papers on the topic. As a
result,
>>> a
>>>>>>> number of open source runtimes exist for Dataflow, such as the
Apache
>>>>> Flink
>>>>>>> and Apache Spark runners.
>>>>>>>
>>>>>>> We believe that submitting Dataflow as an Apache project will
provide
>>> an
>>>>>>> immediate, worthwhile, and substantial contribution to the open
source
>>>>>>> community. As an incubating project, we believe Dataflow will
have a
>>>>> better
>>>>>>> opportunity to provide a meaningful contribution to OSS and also
>>>>> integrate
>>>>>>> with other Apache projects.
>>>>>>>
>>>>>>> In the long term, we believe Dataflow can be a powerful abstraction
>>>>> layer
>>>>>>> for data processing. By providing an abstraction layer for data
>>>>> pipelines
>>>>>>> and processing, data workflows can be increasingly portable,
resilient
>>>>> to
>>>>>>> breaking changes in tooling, and compatible across many execution
>>>>> engines,
>>>>>>> runtimes, and open source projects.
>>>>>>>
>>>>>>> == Initial Goals ==
>>>>>>>
>>>>>>> We are breaking our initial goals into immediate (< 2 months),
>>>>> short-term
>>>>>>> (2-4 months), and intermediate-term (> 4 months).
>>>>>>>
>>>>>>> Our immediate goals include the following:
>>>>>>>
>>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners
into
>>>>> one
>>>>>>> project
>>>>>>>
>>>>>>> * Plan for refactoring the existing Java SDK for better extensibility
>>> by
>>>>>>> SDK and runner writers
>>>>>>>
>>>>>>> * Validating all dependencies are ASL 2.0 or compatible
>>>>>>>
>>>>>>> * Understanding and adapting to the Apache development process
>>>>>>>
>>>>>>> Our short-term goals include:
>>>>>>>
>>>>>>> * Moving the newly-merged lists, and build utilities to Apache
>>>>>>>
>>>>>>> * Start refactoring codebase and move code to Apache Git repo
>>>>>>>
>>>>>>> * Continue development of new features, functions, and fixes
in the
>>>>>>> Dataflow Java SDK, and Dataflow runners
>>>>>>>
>>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap
and plan
>>>>> for
>>>>>>> how to include new major ideas, modules, and runtimes
>>>>>>>
>>>>>>> * Establishment of easy and clear build/test framework for Dataflow
>>> and
>>>>>>> associated runtimes; creation of testing, rollback, and validation
>>>>> policy
>>>>>>>
>>>>>>> * Analysis and design for work needed to make Dataflow a better
data
>>>>>>> processing abstraction layer for multiple open source frameworks
and
>>>>>>> environments
>>>>>>>
>>>>>>> Finally, we have a number of intermediate-term goals:
>>>>>>>
>>>>>>> * Roadmapping, planning, and execution of integrations with other
OSS
>>>>> and
>>>>>>> non-OSS projects/products
>>>>>>>
>>>>>>> * Inclusion of additional SDK for Python, which is under active
>>>>> development
>>>>>>>
>>>>>>> == Current Status ==
>>>>>>>
>>>>>>> === Meritocracy ===
>>>>>>>
>>>>>>> Dataflow was initially developed based on ideas from many employees
>>>>> within
>>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has
received
>>>>>>> contributions from data Artisans, Cloudera Labs, and other individual
>>>>>>> developers. As a project under incubation, we are committed to
>>> expanding
>>>>>>> our effort to build an environment which supports a meritocracy.
We
>>> are
>>>>>>> focused on engaging the community and other related projects
for
>>> support
>>>>>>> and contributions. Moreover, we are committed to ensure contributors
>>> and
>>>>>>> committers to Dataflow come from a broad mix of organizations
through
>>> a
>>>>>>> merit-based decision process during incubation. We believe strongly
in
>>>>> the
>>>>>>> Dataflow model and are committed to growing an inclusive community
of
>>>>>>> Dataflow contributors.
>>>>>>>
>>>>>>> === Community ===
>>>>>>>
>>>>>>> The core of the Dataflow Java SDK has been developed by Google
for use
>>>>> with
>>>>>>> Google Cloud Dataflow. Google has active community engagement
in the
>>> SDK
>>>>>>> GitHub repository (
>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>> ),
>>>>>>> on Stack Overflow (
>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
and
>>>>> has
>>>>>>> had contributions from a number of organizations and indivuduals.
>>>>>>>
>>>>>>> Everyday, Cloud Dataflow is actively used by a number of organizations
>>>>> and
>>>>>>> institutions for batch and stream processing of data. We believe
>>>>> acceptance
>>>>>>> will allow us to consolidate existing Dataflow-related work,
grow the
>>>>>>> Dataflow community, and deepen connections between Dataflow and
other
>>>>> open
>>>>>>> source projects.
>>>>>>>
>>>>>>> === Core Developers ===
>>>>>>>
>>>>>>> The core developers for Dataflow and the Dataflow runners are:
>>>>>>>
>>>>>>> * Frances Perry
>>>>>>>
>>>>>>> * Tyler Akidau
>>>>>>>
>>>>>>> * Davor Bonaci
>>>>>>>
>>>>>>> * Luke Cwik
>>>>>>>
>>>>>>> * Ben Chambers
>>>>>>>
>>>>>>> * Kenn Knowles
>>>>>>>
>>>>>>> * Dan Halperin
>>>>>>>
>>>>>>> * Daniel Mills
>>>>>>>
>>>>>>> * Mark Shields
>>>>>>>
>>>>>>> * Craig Chambers
>>>>>>>
>>>>>>> * Maximilian Michels
>>>>>>>
>>>>>>> * Tom White
>>>>>>>
>>>>>>> * Josh Wills
>>>>>>>
>>>>>>> === Alignment ===
>>>>>>>
>>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which
can be
>>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also related
to
>>>>> other
>>>>>>> Apache projects, such as Apache Crunch. We plan on expanding
>>>>> functionality
>>>>>>> for Dataflow runners, support for additional domain specific
>>> languages,
>>>>> and
>>>>>>> increased portability so Dataflow is a powerful abstraction layer
for
>>>>> data
>>>>>>> processing.
>>>>>>>
>>>>>>> == Known Risks ==
>>>>>>>
>>>>>>> === Orphaned Products ===
>>>>>>>
>>>>>>> The Dataflow SDK is presently used by several organizations,
from
>>> small
>>>>>>> startups to Fortune 100 companies, to construct production pipelines
>>>>> which
>>>>>>> are executed in Google Cloud Dataflow. Google has a long-term
>>>>> commitment to
>>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing
>>>>> interest,
>>>>>>> development, and adoption from organizations outside of Google.
>>>>>>>
>>>>>>> === Inexperience with Open Source ===
>>>>>>>
>>>>>>> Google believes strongly in open source and the exchange of
>>> information
>>>>> to
>>>>>>> advance new ideas and work. Examples of this commitment are active
OSS
>>>>>>> projects such as Chromium (https://www.chromium.org) and Kubernetes
(
>>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be
>>> increasingly
>>>>>>> open and forward-looking; we have published a paper in the VLDB
>>>>> conference
>>>>>>> describing the Dataflow model (
>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick
to
>>>>> release
>>>>>>> the Dataflow SDK as open source software with the launch of Cloud
>>>>> Dataflow.
>>>>>>> Our submission to the Apache Software Foundation is a logical
>>> extension
>>>>> of
>>>>>>> our commitment to open source software.
>>>>>>>
>>>>>>> === Homogeneous Developers ===
>>>>>>>
>>>>>>> The majority of committers in this proposal belong to Google
due to
>>> the
>>>>>>> fact that Dataflow has emerged from several internal Google projects.
>>>>> This
>>>>>>> proposal also includes committers outside of Google who are actively
>>>>>>> involved with other Apache projects, such as Hadoop, Flink, and
Spark.
>>>>> We
>>>>>>> expect our entry into incubation will allow us to expand the
number of
>>>>>>> individuals and organizations participating in Dataflow development.
>>>>>>> Additionally, separation of the Dataflow SDK from Google Cloud
>>> Dataflow
>>>>>>> allows us to focus on the open source SDK and model and do what
is
>>> best
>>>>> for
>>>>>>> this project.
>>>>>>>
>>>>>>> === Reliance on Salaried Developers ===
>>>>>>>
>>>>>>> The Dataflow SDK and Dataflow runners have been developed primarily
by
>>>>>>> salaried developers supporting the Google Cloud Dataflow project.
>>> While
>>>>> the
>>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different
teams
>>>>> (and
>>>>>>> this proposal would reinforce that separation) we expect our
initial
>>>>> set of
>>>>>>> developers will still primarily be salaried. Contribution has
not been
>>>>>>> exclusively from salaried developers, however. For example, the
>>> contrib
>>>>>>> directory of the Dataflow SDK (
>>>>>>>
>>>>>
>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contrib
>>>>>>> )
>>>>>>> contains items from free-time contributors. Moreover, seperate
>>> projects,
>>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have
been
>>>>> created
>>>>>>> around the Dataflow model and SDK. We expect our reliance on
salaried
>>>>>>> developers will decrease over time during incubation.
>>>>>>>
>>>>>>> === Relationship with other Apache products ===
>>>>>>>
>>>>>>> Dataflow directly interoperates with or utilizes several existing
>>> Apache
>>>>>>> projects.
>>>>>>>
>>>>>>> * Build
>>>>>>>
>>>>>>> ** Apache Maven
>>>>>>>
>>>>>>> * Data I/O, Libraries
>>>>>>>
>>>>>>> ** Apache Avro
>>>>>>>
>>>>>>> ** Apache Commons
>>>>>>>
>>>>>>> * Dataflow runners
>>>>>>>
>>>>>>> ** Apache Flink
>>>>>>>
>>>>>>> ** Apache Spark
>>>>>>>
>>>>>>> Dataflow when used in batch mode shares similarities with Apache
>>> Crunch;
>>>>>>> however, Dataflow is focused on a model, SDK, and abstraction
layer
>>>>> beyond
>>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is to
provide
>>> an
>>>>>>> intermediate abstraction layer which can easily be implemented
and
>>>>> utilized
>>>>>>> across several different processing frameworks.
>>>>>>>
>>>>>>> === An excessive fascination with the Apache brand ===
>>>>>>>
>>>>>>> With this proposal we are not seeking attention or publicity.
Rather,
>>> we
>>>>>>> firmly believe in the Dataflow model, SDK, and the ability to
make
>>>>> Dataflow
>>>>>>> a powerful yet simple framework for data processing. While the
>>> Dataflow
>>>>> SDK
>>>>>>> and model have been open source, we believe putting code on GitHub
can
>>>>> only
>>>>>>> go so far. We see the Apache community, processes, and mission
as
>>>>> critical
>>>>>>> for ensuring the Dataflow SDK and model are truly community-driven,
>>>>>>> positively impactful, and innovative open source software. While
>>> Google
>>>>> has
>>>>>>> taken a number of steps to advance its various open source projects,
>>> we
>>>>>>> believe Dataflow is a great fit for the Apache Software Foundation
due
>>>>> to
>>>>>>> its focus on data processing and its relationships to existing
ASF
>>>>>>> projects.
>>>>>>>
>>>>>>> == Documentation ==
>>>>>>>
>>>>>>> The following documentation is relevant to this proposal. Relevant
>>>>> portion
>>>>>>> of the documentation will be contributed to the Apache Dataflow
>>> project.
>>>>>>>
>>>>>>> * Dataflow website: https://cloud.google.com/dataflow
>>>>>>>
>>>>>>> * Dataflow programming model:
>>>>>>> https://cloud.google.com/dataflow/model/programming-model
>>>>>>>
>>>>>>> * Codebases
>>>>>>>
>>>>>>> ** Dataflow Java SDK:
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>
>>>>>>> ** Flink Dataflow runner:
>>>>> https://github.com/dataArtisans/flink-dataflow
>>>>>>>
>>>>>>> ** Spark Dataflow runner: https://github.com/cloudera/spark-dataflow
>>>>>>>
>>>>>>> * Dataflow Java SDK issue tracker:
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
>>>>>>>
>>>>>>> * google-cloud-dataflow tag on Stack Overflow:
>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow
>>>>>>>
>>>>>>> == Initial Source ==
>>>>>>>
>>>>>>> The initial source for Dataflow which we will submit to the Apache
>>>>>>> Foundation will include several related projects which are currently
>>>>> hosted
>>>>>>> on the GitHub repositories:
>>>>>>>
>>>>>>> * Dataflow Java SDK (
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
>>>>>>>
>>>>>>> * Flink Dataflow runner (
>>> https://github.com/dataArtisans/flink-dataflow
>>>>> )
>>>>>>>
>>>>>>> * Spark Dataflow runner (https://github.com/cloudera/spark-dataflow)
>>>>>>>
>>>>>>> These projects have always been Apache 2.0 licensed. We intend
to
>>> bundle
>>>>>>> all of these repositories since they are all complimentary and
should
>>> be
>>>>>>> maintained in one project. Prior to our submission, we will combine
>>> all
>>>>> of
>>>>>>> these projects into a new git repository.
>>>>>>>
>>>>>>> == Source and Intellectual Property Submission Plan ==
>>>>>>>
>>>>>>> The source for the Dataflow SDK and the three runners (Spark,
Flink,
>>>>> Google
>>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license.
>>>>>>>
>>>>>>> * Dataflow SDK -
>>>>>>>
>>>>>
>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENSE
>>>>>>>
>>>>>>> * Flink runner -
>>>>>>> https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
>>>>>>>
>>>>>>> * Spark runner -
>>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
>>>>>>>
>>>>>>> Contributors to the Dataflow SDK have also signed the Google
>>> Individual
>>>>>>> Contributor License Agreement (
>>>>>>> https://cla.developers.google.com/about/google-individual) in
order
>>> to
>>>>>>> contribute to the project.
>>>>>>>
>>>>>>> With respect to trademark rights, Google does not hold a trademark
on
>>>>> the
>>>>>>> phrase “Dataflow.” Based on feedback and guidance we receive
during
>>> the
>>>>>>> incubation process, we are open to renaming the project if necessary
>>> for
>>>>>>> trademark or other concerns.
>>>>>>>
>>>>>>> == External Dependencies ==
>>>>>>>
>>>>>>> All external dependencies are licensed under an Apache 2.0 or
>>>>>>> Apache-compatible license. As we grow the Dataflow community
we will
>>>>>>> configure our build process to require and validate all contributions
>>>>> and
>>>>>>> dependencies are licensed under the Apache 2.0 license or are
under an
>>>>>>> Apache-compatible license.
>>>>>>>
>>>>>>> == Required Resources ==
>>>>>>>
>>>>>>> === Mailing Lists ===
>>>>>>>
>>>>>>> We currently use a mix of mailing lists. We will migrate our
existing
>>>>>>> mailing lists to the following:
>>>>>>>
>>>>>>> * dev@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * user@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * private@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * commits@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> === Source Control ===
>>>>>>>
>>>>>>> The Dataflow team currently uses Git and would like to continue
to do
>>>>> so.
>>>>>>> We request a Git repository for Dataflow with mirroring to GitHub
>>>>> enabled.
>>>>>>>
>>>>>>> === Issue Tracking ===
>>>>>>>
>>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow
project
>>>>> is
>>>>>>> currently using both a public GitHub issue tracker and internal
Google
>>>>>>> issue tracking. We will migrate and combine from these two sources
to
>>>>> the
>>>>>>> Apache JIRA.
>>>>>>>
>>>>>>> == Initial Committers ==
>>>>>>>
>>>>>>> * Aljoscha Krettek     [aljoscha@apache.org]
>>>>>>>
>>>>>>> * Amit Sela            [amitsela33@gmail.com]
>>>>>>>
>>>>>>> * Ben Chambers         [bchambers@google.com]
>>>>>>>
>>>>>>> * Craig Chambers       [chambers@google.com]
>>>>>>>
>>>>>>> * Dan Halperin         [dhalperi@google.com]
>>>>>>>
>>>>>>> * Davor Bonaci         [davor@google.com]
>>>>>>>
>>>>>>> * Frances Perry        [fjp@google.com]
>>>>>>>
>>>>>>> * James Malone         [jamesmalone@google.com]
>>>>>>>
>>>>>>> * Jean-Baptiste Onofré [jbonofre@apache.org]
>>>>>>>
>>>>>>> * Josh Wills           [jwills@apache.org]
>>>>>>>
>>>>>>> * Kostas Tzoumas       [kostas@data-artisans.com]
>>>>>>>
>>>>>>> * Kenneth Knowles      [klk@google.com]
>>>>>>>
>>>>>>> * Luke Cwik            [lcwik@google.com]
>>>>>>>
>>>>>>> * Maximilian Michels   [mxm@apache.org]
>>>>>>>
>>>>>>> * Stephan Ewen         [stephan@data-artisans.com]
>>>>>>>
>>>>>>> * Tom White            [tom@cloudera.com]
>>>>>>>
>>>>>>> * Tyler Akidau         [takidau@google.com]
>>>>>>>
>>>>>>> == Affiliations ==
>>>>>>>
>>>>>>> The initial committers are from six organizations. Google developed
>>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the Flink
>>> runner,
>>>>>>> and Cloudera (Labs) developed the Spark runner.
>>>>>>>
>>>>>>> * Cloudera
>>>>>>>
>>>>>>> ** Tom White
>>>>>>>
>>>>>>> * Data Artisans
>>>>>>>
>>>>>>> ** Aljoscha Krettek
>>>>>>>
>>>>>>> ** Kostas Tzoumas
>>>>>>>
>>>>>>> ** Maximilian Michels
>>>>>>>
>>>>>>> ** Stephan Ewen
>>>>>>>
>>>>>>> * Google
>>>>>>>
>>>>>>> ** Ben Chambers
>>>>>>>
>>>>>>> ** Dan Halperin
>>>>>>>
>>>>>>> ** Davor Bonaci
>>>>>>>
>>>>>>> ** Frances Perry
>>>>>>>
>>>>>>> ** James Malone
>>>>>>>
>>>>>>> ** Kenneth Knowles
>>>>>>>
>>>>>>> ** Luke Cwik
>>>>>>>
>>>>>>> ** Tyler Akidau
>>>>>>>
>>>>>>> * PayPal
>>>>>>>
>>>>>>> ** Amit Sela
>>>>>>>
>>>>>>> * Slack
>>>>>>>
>>>>>>> ** Josh Wills
>>>>>>>
>>>>>>> * Talend
>>>>>>>
>>>>>>> ** Jean-Baptiste Onofré
>>>>>>>
>>>>>>> == Sponsors ==
>>>>>>>
>>>>>>> === Champion ===
>>>>>>>
>>>>>>> * Jean-Baptiste Onofre      [jbonofre@apache.org]
>>>>>>>
>>>>>>> === Nominated Mentors ===
>>>>>>>
>>>>>>> * Jim Jagielski           [jim@apache.org]
>>>>>>>
>>>>>>> * Venkatesh Seetharam     [venkatesh@apache.org]
>>>>>>>
>>>>>>> * Bertrand Delacretaz     [bdelacretaz@apache.org]
>>>>>>>
>>>>>>> * Ted Dunning             [tdunning@apache.org]
>>>>>>>
>>>>>>> === Sponsoring Entity ===
>>>>>>>
>>>>>>> The Apache Incubator
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message