incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: [DISCUSS] Apache Dataflow Incubator Proposal
Date Fri, 22 Jan 2016 18:42:39 GMT
Crunch has Spark pipelines, but not sure about the runner abstraction.

May be Josh Wills or Tom White can provide more insight on this topic.
They are core devs for both projects :)

On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <jb@nanthrax.net> wrote:
> Hi,
>
> I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce pipeline, it
> doesn't provide runner abstraction. It's based on FlumeJava.
>
> The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm
> wrong, but Crunch started after Google Dataflow, especially because Dataflow
> was not opensourced at that time.
>
> So, I agree it's very similar/close.
>
> Regards
> JB
>
>
> On 01/22/2016 05:51 PM, Ashish wrote:
>>
>> Hi JB,
>>
>> Curious to know about how it compares to Apache Crunch? Constructs
>> looks very familiar (had used Crunch long ago)
>>
>> Thoughts?
>>
>> - Ashish
>>
>> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré <jb@nanthrax.net>
>> wrote:
>>>
>>> Hi Seshu,
>>>
>>> I blogged about Apache Dataflow proposal:
>>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>>>
>>> You can see in the "what's next ?" section that new runners, skins and
>>> sources are on our roadmap. Definitely, a storm runner could be part of
>>> this.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
>>>>
>>>>
>>>> Awesome to see CloudDataFlow coming to Apache. The Stream Processing
>>>> area
>>>> has been in general fragmented with a variety of solutions, hoping the
>>>> community galvanizes around Apache Data Flow.
>>>>
>>>> We are still in the "Apache Storm" world, Any chance for folks building
>>>> a
>>>> "Storm Runner²?
>>>>
>>>>
>>>> On 1/20/16, 9:39 AM, "James Malone" <jamesmalone@google.com.INVALID>
>>>> wrote:
>>>>
>>>>>> Great proposal. I like that your proposal includes a well presented
>>>>>> roadmap, but I don't see any goals that directly address building a
>>>>>> larger
>>>>>> community. Y'all have any ideas around outreach that will help with
>>>>>> adoption?
>>>>>>
>>>>>
>>>>> Thank you and fair point. We have a few additional ideas which we can
>>>>> put
>>>>> into the Community section.
>>>>>
>>>>>
>>>>>>
>>>>>> As a start, I recommend y'all add a section to the proposal on the
>>>>>> wiki
>>>>>> page for "Additional Interested Contributors" so that folks who want
>>>>>> to
>>>>>> sign up to participate in the project can do so without requesting
>>>>>> additions to the initial committer list.
>>>>>>
>>>>>>
>>>>> This is a great idea and I think it makes a lot of sense to add an
>>>>> "Additional
>>>>> Interested Contributors" section to the proposal.
>>>>>
>>>>>
>>>>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>>>>>> jamesmalone@google.com.invalid> wrote:
>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> Attached to this message is a proposed new project - Apache Dataflow,
>>>>>>
>>>>>>
>>>>>> a
>>>>>>>
>>>>>>>
>>>>>>> unified programming model for data processing and integration.
>>>>>>>
>>>>>>> The text of the proposal is included below. Additionally, the
>>>>>>
>>>>>>
>>>>>> proposal is
>>>>>>>
>>>>>>>
>>>>>>> in draft form on the wiki where we will make any required changes:
>>>>>>>
>>>>>>> https://wiki.apache.org/incubator/DataflowProposal
>>>>>>>
>>>>>>> We look forward to your feedback and input.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> = Apache Dataflow =
>>>>>>>
>>>>>>> == Abstract ==
>>>>>>>
>>>>>>> Dataflow is an open source, unified model and set of
>>>>>>> language-specific
>>>>>>
>>>>>>
>>>>>> SDKs
>>>>>>>
>>>>>>>
>>>>>>> for defining and executing data processing workflows, and also data
>>>>>>> ingestion and integration flows, supporting Enterprise Integration
>>>>>>
>>>>>>
>>>>>> Patterns
>>>>>>>
>>>>>>>
>>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>>>>>
>>>>>>
>>>>>> simplify
>>>>>>>
>>>>>>>
>>>>>>> the mechanics of large-scale batch and streaming data processing and
>>>>>>
>>>>>>
>>>>>> can
>>>>>>>
>>>>>>>
>>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and
>>>>>>
>>>>>>
>>>>>> Google
>>>>>>>
>>>>>>>
>>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>>>>>
>>>>>>
>>>>>> different
>>>>>>>
>>>>>>>
>>>>>>> languages, allowing users to easily implement their data integration
>>>>>>> processes.
>>>>>>>
>>>>>>> == Proposal ==
>>>>>>>
>>>>>>> Dataflow is a simple, flexible, and powerful system for distributed
>>>>>>
>>>>>>
>>>>>> data
>>>>>>>
>>>>>>>
>>>>>>> processing at any scale. Dataflow provides a unified programming
>>>>>>
>>>>>>
>>>>>> model, a
>>>>>>>
>>>>>>>
>>>>>>> software development kit to define and construct data processing
>>>>>>
>>>>>>
>>>>>> pipelines,
>>>>>>>
>>>>>>>
>>>>>>> and runners to execute Dataflow pipelines in several runtime engines,
>>>>>>
>>>>>>
>>>>>> like
>>>>>>>
>>>>>>>
>>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>>>>>>
>>>>>>
>>>>>> used
>>>>>>>
>>>>>>>
>>>>>>> for a variety of streaming or batch data processing goals including
>>>>>>
>>>>>>
>>>>>> ETL,
>>>>>>>
>>>>>>>
>>>>>>> stream analysis, and aggregate computation. The underlying
>>>>>>> programming
>>>>>>> model for Dataflow provides MapReduce-like parallelism, combined with
>>>>>>> support for powerful data windowing, and fine-grained correctness
>>>>>>
>>>>>>
>>>>>> control.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> == Background ==
>>>>>>>
>>>>>>> Dataflow started as a set of Google projects focused on making data
>>>>>>> processing easier, faster, and less costly. The Dataflow model is a
>>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>>>>>>> focused on providing a unified solution for batch and stream
>>>>>>
>>>>>>
>>>>>> processing.
>>>>>>>
>>>>>>>
>>>>>>> These projects on which Dataflow is based have been published in
>>>>>>
>>>>>>
>>>>>> several
>>>>>>>
>>>>>>>
>>>>>>> papers made available to the public:
>>>>>>>
>>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>>>>>
>>>>>>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>>>>
>>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>>>>>
>>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>>>>>
>>>>>>> Dataflow was designed from the start to provide a portable
>>>>>>> programming
>>>>>>> layer. When you define a data processing pipeline with the Dataflow
>>>>>>
>>>>>>
>>>>>> model,
>>>>>>>
>>>>>>>
>>>>>>> you are creating a job which is capable of being processed by any
>>>>>>
>>>>>>
>>>>>> number
>>>>>> of
>>>>>>>
>>>>>>>
>>>>>>> Dataflow processing engines. Several engines have been developed to
>>>>>>
>>>>>>
>>>>>> run
>>>>>>>
>>>>>>>
>>>>>>> Dataflow pipelines in other open source runtimes, including a
>>>>>>> Dataflow
>>>>>>> runner for Apache Flink and Apache Spark. There is also a ³direct
>>>>>>
>>>>>>
>>>>>> runner²,
>>>>>>>
>>>>>>>
>>>>>>> for execution on the developer machine (mainly for dev/debug
>>>>>>
>>>>>>
>>>>>> purposes).
>>>>>>>
>>>>>>>
>>>>>>> Another runner allows a Dataflow program to run on a managed service,
>>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
>>>>>>
>>>>>>
>>>>>> SDK is
>>>>>>>
>>>>>>>
>>>>>>> already available on GitHub, and independent from the Google Cloud
>>>>>>
>>>>>>
>>>>>> Dataflow
>>>>>>>
>>>>>>>
>>>>>>> service. Another Python SDK is currently in active development.
>>>>>>>
>>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners will
>>>>>>
>>>>>>
>>>>>> be
>>>>>>>
>>>>>>>
>>>>>>> submitted as an OSS project under the ASF. The runners which are a
>>>>>>
>>>>>>
>>>>>> part
>>>>>> of
>>>>>>>
>>>>>>>
>>>>>>> this proposal include those for Spark (from Cloudera), Flink (from
>>>>>>
>>>>>>
>>>>>> data
>>>>>>>
>>>>>>>
>>>>>>> Artisans), and local development (from Google); the Google Cloud
>>>>>>
>>>>>>
>>>>>> Dataflow
>>>>>>>
>>>>>>>
>>>>>>> service runner is not included in this proposal. Further references
>>>>>>> to
>>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners which
>>>>>>
>>>>>>
>>>>>> are a
>>>>>>>
>>>>>>>
>>>>>>> part of this proposal (Apache Dataflow) only. The initial submission
>>>>>>
>>>>>>
>>>>>> will
>>>>>>>
>>>>>>>
>>>>>>> contain the already-released Java SDK; Google intends to submit the
>>>>>>
>>>>>>
>>>>>> Python
>>>>>>>
>>>>>>>
>>>>>>> SDK later in the incubation process. The Google Cloud Dataflow
>>>>>>> service
>>>>>>
>>>>>>
>>>>>> will
>>>>>>>
>>>>>>>
>>>>>>> continue to be one of many runners for Dataflow, built on Google
>>>>>>> Cloud
>>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
>>>>>>> develop against the Apache project additions, updates, and changes.
>>>>>>
>>>>>>
>>>>>> Google
>>>>>>>
>>>>>>>
>>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will
>>>>>>
>>>>>>
>>>>>> participate
>>>>>>>
>>>>>>>
>>>>>>> in the project openly and publicly.
>>>>>>>
>>>>>>> The Dataflow programming model has been designed with simplicity,
>>>>>>> scalability, and speed as key tenants. In the Dataflow model, you
>>>>>>> only
>>>>>>
>>>>>>
>>>>>> need
>>>>>>>
>>>>>>>
>>>>>>> to think about four top-level concepts when constructing your data
>>>>>>> processing job:
>>>>>>>
>>>>>>> * Pipelines - The data processing job made of a series of
>>>>>>> computations
>>>>>>> including input, processing, and output
>>>>>>>
>>>>>>> * PCollections - Bounded (or unbounded) datasets which represent the
>>>>>>
>>>>>>
>>>>>> input,
>>>>>>>
>>>>>>>
>>>>>>> intermediate and output data in pipelines
>>>>>>>
>>>>>>> * PTransforms - A data processing step in a pipeline in which one or
>>>>>>
>>>>>>
>>>>>> more
>>>>>>>
>>>>>>>
>>>>>>> PCollections are an input and output
>>>>>>>
>>>>>>> * I/O Sources and Sinks - APIs for reading and writing data which are
>>>>>>
>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>> roots and endpoints of the pipeline
>>>>>>>
>>>>>>> == Rationale ==
>>>>>>>
>>>>>>> With Dataflow, Google intended to develop a framework which allowed
>>>>>>> developers to be maximally productive in defining the processing, and
>>>>>>
>>>>>>
>>>>>> then
>>>>>>>
>>>>>>>
>>>>>>> be able to execute the program at various levels of
>>>>>>> latency/cost/completeness without re-architecting or re-writing it.
>>>>>>
>>>>>>
>>>>>> This
>>>>>>>
>>>>>>>
>>>>>>> goal was informed by Google¹s past experience  developing several
>>>>>>
>>>>>>
>>>>>> models,
>>>>>>>
>>>>>>>
>>>>>>> frameworks, and tools useful for large-scale and distributed data
>>>>>>> processing. While Google has previously published papers describing
>>>>>>
>>>>>>
>>>>>> some
>>>>>> of
>>>>>>>
>>>>>>>
>>>>>>> its technologies, Google decided to take a different approach with
>>>>>>> Dataflow. Google open-sourced the SDK and model alongside
>>>>>>
>>>>>>
>>>>>> commercialization
>>>>>>>
>>>>>>>
>>>>>>> of the idea and ahead of publishing papers on the topic. As a result,
>>>>>>
>>>>>>
>>>>>> a
>>>>>>>
>>>>>>>
>>>>>>> number of open source runtimes exist for Dataflow, such as the Apache
>>>>>>
>>>>>>
>>>>>> Flink
>>>>>>>
>>>>>>>
>>>>>>> and Apache Spark runners.
>>>>>>>
>>>>>>> We believe that submitting Dataflow as an Apache project will provide
>>>>>>
>>>>>>
>>>>>> an
>>>>>>>
>>>>>>>
>>>>>>> immediate, worthwhile, and substantial contribution to the open
>>>>>>> source
>>>>>>> community. As an incubating project, we believe Dataflow will have a
>>>>>>
>>>>>>
>>>>>> better
>>>>>>>
>>>>>>>
>>>>>>> opportunity to provide a meaningful contribution to OSS and also
>>>>>>
>>>>>>
>>>>>> integrate
>>>>>>>
>>>>>>>
>>>>>>> with other Apache projects.
>>>>>>>
>>>>>>> In the long term, we believe Dataflow can be a powerful abstraction
>>>>>>
>>>>>>
>>>>>> layer
>>>>>>>
>>>>>>>
>>>>>>> for data processing. By providing an abstraction layer for data
>>>>>>
>>>>>>
>>>>>> pipelines
>>>>>>>
>>>>>>>
>>>>>>> and processing, data workflows can be increasingly portable,
>>>>>>
>>>>>>
>>>>>> resilient to
>>>>>>>
>>>>>>>
>>>>>>> breaking changes in tooling, and compatible across many execution
>>>>>>
>>>>>>
>>>>>> engines,
>>>>>>>
>>>>>>>
>>>>>>> runtimes, and open source projects.
>>>>>>>
>>>>>>> == Initial Goals ==
>>>>>>>
>>>>>>> We are breaking our initial goals into immediate (< 2 months),
>>>>>>
>>>>>>
>>>>>> short-term
>>>>>>>
>>>>>>>
>>>>>>> (2-4 months), and intermediate-term (> 4 months).
>>>>>>>
>>>>>>> Our immediate goals include the following:
>>>>>>>
>>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners into
>>>>>>
>>>>>>
>>>>>> one
>>>>>>>
>>>>>>>
>>>>>>> project
>>>>>>>
>>>>>>> * Plan for refactoring the existing Java SDK for better extensibility
>>>>>>
>>>>>>
>>>>>> by
>>>>>>>
>>>>>>>
>>>>>>> SDK and runner writers
>>>>>>>
>>>>>>> * Validating all dependencies are ASL 2.0 or compatible
>>>>>>>
>>>>>>> * Understanding and adapting to the Apache development process
>>>>>>>
>>>>>>> Our short-term goals include:
>>>>>>>
>>>>>>> * Moving the newly-merged lists, and build utilities to Apache
>>>>>>>
>>>>>>> * Start refactoring codebase and move code to Apache Git repo
>>>>>>>
>>>>>>> * Continue development of new features, functions, and fixes in the
>>>>>>> Dataflow Java SDK, and Dataflow runners
>>>>>>>
>>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap and
>>>>>>> plan
>>>>>>
>>>>>>
>>>>>> for
>>>>>>>
>>>>>>>
>>>>>>> how to include new major ideas, modules, and runtimes
>>>>>>>
>>>>>>> * Establishment of easy and clear build/test framework for Dataflow
>>>>>>
>>>>>>
>>>>>> and
>>>>>>>
>>>>>>>
>>>>>>> associated runtimes; creation of testing, rollback, and validation
>>>>>>
>>>>>>
>>>>>> policy
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * Analysis and design for work needed to make Dataflow a better data
>>>>>>> processing abstraction layer for multiple open source frameworks and
>>>>>>> environments
>>>>>>>
>>>>>>> Finally, we have a number of intermediate-term goals:
>>>>>>>
>>>>>>> * Roadmapping, planning, and execution of integrations with other OSS
>>>>>>
>>>>>>
>>>>>> and
>>>>>>>
>>>>>>>
>>>>>>> non-OSS projects/products
>>>>>>>
>>>>>>> * Inclusion of additional SDK for Python, which is under active
>>>>>>
>>>>>>
>>>>>> development
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> == Current Status ==
>>>>>>>
>>>>>>> === Meritocracy ===
>>>>>>>
>>>>>>> Dataflow was initially developed based on ideas from many employees
>>>>>>
>>>>>>
>>>>>> within
>>>>>>>
>>>>>>>
>>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has
>>>>>>> received
>>>>>>> contributions from data Artisans, Cloudera Labs, and other individual
>>>>>>> developers. As a project under incubation, we are committed to
>>>>>>
>>>>>>
>>>>>> expanding
>>>>>>>
>>>>>>>
>>>>>>> our effort to build an environment which supports a meritocracy. We
>>>>>>
>>>>>>
>>>>>> are
>>>>>>>
>>>>>>>
>>>>>>> focused on engaging the community and other related projects for
>>>>>>
>>>>>>
>>>>>> support
>>>>>>>
>>>>>>>
>>>>>>> and contributions. Moreover, we are committed to ensure contributors
>>>>>>
>>>>>>
>>>>>> and
>>>>>>>
>>>>>>>
>>>>>>> committers to Dataflow come from a broad mix of organizations through
>>>>>>
>>>>>>
>>>>>> a
>>>>>>>
>>>>>>>
>>>>>>> merit-based decision process during incubation. We believe strongly
>>>>>>> in
>>>>>>
>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>> Dataflow model and are committed to growing an inclusive community of
>>>>>>> Dataflow contributors.
>>>>>>>
>>>>>>> === Community ===
>>>>>>>
>>>>>>> The core of the Dataflow Java SDK has been developed by Google for
>>>>>>> use
>>>>>>
>>>>>>
>>>>>> with
>>>>>>>
>>>>>>>
>>>>>>> Google Cloud Dataflow. Google has active community engagement in the
>>>>>>
>>>>>>
>>>>>> SDK
>>>>>>>
>>>>>>>
>>>>>>> GitHub repository (
>>>>>>
>>>>>>
>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>
>>>>>>>
>>>>>>> ),
>>>>>>> on Stack Overflow (
>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow) and
>>>>>>
>>>>>>
>>>>>> has
>>>>>>>
>>>>>>>
>>>>>>> had contributions from a number of organizations and indivuduals.
>>>>>>>
>>>>>>> Everyday, Cloud Dataflow is actively used by a number of
>>>>>>> organizations
>>>>>>
>>>>>>
>>>>>> and
>>>>>>>
>>>>>>>
>>>>>>> institutions for batch and stream processing of data. We believe
>>>>>>
>>>>>>
>>>>>> acceptance
>>>>>>>
>>>>>>>
>>>>>>> will allow us to consolidate existing Dataflow-related work, grow the
>>>>>>> Dataflow community, and deepen connections between Dataflow and other
>>>>>>
>>>>>>
>>>>>> open
>>>>>>>
>>>>>>>
>>>>>>> source projects.
>>>>>>>
>>>>>>> === Core Developers ===
>>>>>>>
>>>>>>> The core developers for Dataflow and the Dataflow runners are:
>>>>>>>
>>>>>>> * Frances Perry
>>>>>>>
>>>>>>> * Tyler Akidau
>>>>>>>
>>>>>>> * Davor Bonaci
>>>>>>>
>>>>>>> * Luke Cwik
>>>>>>>
>>>>>>> * Ben Chambers
>>>>>>>
>>>>>>> * Kenn Knowles
>>>>>>>
>>>>>>> * Dan Halperin
>>>>>>>
>>>>>>> * Daniel Mills
>>>>>>>
>>>>>>> * Mark Shields
>>>>>>>
>>>>>>> * Craig Chambers
>>>>>>>
>>>>>>> * Maximilian Michels
>>>>>>>
>>>>>>> * Tom White
>>>>>>>
>>>>>>> * Josh Wills
>>>>>>>
>>>>>>> === Alignment ===
>>>>>>>
>>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which can
>>>>>>> be
>>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also related to
>>>>>>
>>>>>>
>>>>>> other
>>>>>>>
>>>>>>>
>>>>>>> Apache projects, such as Apache Crunch. We plan on expanding
>>>>>>
>>>>>>
>>>>>> functionality
>>>>>>>
>>>>>>>
>>>>>>> for Dataflow runners, support for additional domain specific
>>>>>>
>>>>>>
>>>>>> languages,
>>>>>> and
>>>>>>>
>>>>>>>
>>>>>>> increased portability so Dataflow is a powerful abstraction layer for
>>>>>>
>>>>>>
>>>>>> data
>>>>>>>
>>>>>>>
>>>>>>> processing.
>>>>>>>
>>>>>>> == Known Risks ==
>>>>>>>
>>>>>>> === Orphaned Products ===
>>>>>>>
>>>>>>> The Dataflow SDK is presently used by several organizations, from
>>>>>>
>>>>>>
>>>>>> small
>>>>>>>
>>>>>>>
>>>>>>> startups to Fortune 100 companies, to construct production pipelines
>>>>>>
>>>>>>
>>>>>> which
>>>>>>>
>>>>>>>
>>>>>>> are executed in Google Cloud Dataflow. Google has a long-term
>>>>>>
>>>>>>
>>>>>> commitment
>>>>>> to
>>>>>>>
>>>>>>>
>>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing
>>>>>>
>>>>>>
>>>>>> interest,
>>>>>>>
>>>>>>>
>>>>>>> development, and adoption from organizations outside of Google.
>>>>>>>
>>>>>>> === Inexperience with Open Source ===
>>>>>>>
>>>>>>> Google believes strongly in open source and the exchange of
>>>>>>
>>>>>>
>>>>>> information
>>>>>> to
>>>>>>>
>>>>>>>
>>>>>>> advance new ideas and work. Examples of this commitment are active
>>>>>>> OSS
>>>>>>> projects such as Chromium (https://www.chromium.org) and Kubernetes (
>>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be
>>>>>>
>>>>>>
>>>>>> increasingly
>>>>>>>
>>>>>>>
>>>>>>> open and forward-looking; we have published a paper in the VLDB
>>>>>>
>>>>>>
>>>>>> conference
>>>>>>>
>>>>>>>
>>>>>>> describing the Dataflow model (
>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick to
>>>>>>
>>>>>>
>>>>>> release
>>>>>>>
>>>>>>>
>>>>>>> the Dataflow SDK as open source software with the launch of Cloud
>>>>>>
>>>>>>
>>>>>> Dataflow.
>>>>>>>
>>>>>>>
>>>>>>> Our submission to the Apache Software Foundation is a logical
>>>>>>
>>>>>>
>>>>>> extension
>>>>>> of
>>>>>>>
>>>>>>>
>>>>>>> our commitment to open source software.
>>>>>>>
>>>>>>> === Homogeneous Developers ===
>>>>>>>
>>>>>>> The majority of committers in this proposal belong to Google due to
>>>>>>
>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>> fact that Dataflow has emerged from several internal Google projects.
>>>>>>
>>>>>>
>>>>>> This
>>>>>>>
>>>>>>>
>>>>>>> proposal also includes committers outside of Google who are actively
>>>>>>> involved with other Apache projects, such as Hadoop, Flink, and
>>>>>>> Spark.
>>>>>>
>>>>>>
>>>>>> We
>>>>>>>
>>>>>>>
>>>>>>> expect our entry into incubation will allow us to expand the number
>>>>>>> of
>>>>>>> individuals and organizations participating in Dataflow development.
>>>>>>> Additionally, separation of the Dataflow SDK from Google Cloud
>>>>>>
>>>>>>
>>>>>> Dataflow
>>>>>>>
>>>>>>>
>>>>>>> allows us to focus on the open source SDK and model and do what is
>>>>>>
>>>>>>
>>>>>> best
>>>>>> for
>>>>>>>
>>>>>>>
>>>>>>> this project.
>>>>>>>
>>>>>>> === Reliance on Salaried Developers ===
>>>>>>>
>>>>>>> The Dataflow SDK and Dataflow runners have been developed primarily
>>>>>>> by
>>>>>>> salaried developers supporting the Google Cloud Dataflow project.
>>>>>>
>>>>>>
>>>>>> While
>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different
>>>>>>> teams
>>>>>>
>>>>>>
>>>>>> (and
>>>>>>>
>>>>>>>
>>>>>>> this proposal would reinforce that separation) we expect our initial
>>>>>>
>>>>>>
>>>>>> set
>>>>>> of
>>>>>>>
>>>>>>>
>>>>>>> developers will still primarily be salaried. Contribution has not
>>>>>>> been
>>>>>>> exclusively from salaried developers, however. For example, the
>>>>>>
>>>>>>
>>>>>> contrib
>>>>>>>
>>>>>>>
>>>>>>> directory of the Dataflow SDK (
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contri
>>>>>> b
>>>>>>>
>>>>>>>
>>>>>>> )
>>>>>>> contains items from free-time contributors. Moreover, seperate
>>>>>>
>>>>>>
>>>>>> projects,
>>>>>>>
>>>>>>>
>>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have been
>>>>>>
>>>>>>
>>>>>> created
>>>>>>>
>>>>>>>
>>>>>>> around the Dataflow model and SDK. We expect our reliance on salaried
>>>>>>> developers will decrease over time during incubation.
>>>>>>>
>>>>>>> === Relationship with other Apache products ===
>>>>>>>
>>>>>>> Dataflow directly interoperates with or utilizes several existing
>>>>>>
>>>>>>
>>>>>> Apache
>>>>>>>
>>>>>>>
>>>>>>> projects.
>>>>>>>
>>>>>>> * Build
>>>>>>>
>>>>>>> ** Apache Maven
>>>>>>>
>>>>>>> * Data I/O, Libraries
>>>>>>>
>>>>>>> ** Apache Avro
>>>>>>>
>>>>>>> ** Apache Commons
>>>>>>>
>>>>>>> * Dataflow runners
>>>>>>>
>>>>>>> ** Apache Flink
>>>>>>>
>>>>>>> ** Apache Spark
>>>>>>>
>>>>>>> Dataflow when used in batch mode shares similarities with Apache
>>>>>>
>>>>>>
>>>>>> Crunch;
>>>>>>>
>>>>>>>
>>>>>>> however, Dataflow is focused on a model, SDK, and abstraction layer
>>>>>>
>>>>>>
>>>>>> beyond
>>>>>>>
>>>>>>>
>>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is to provide
>>>>>>
>>>>>>
>>>>>> an
>>>>>>>
>>>>>>>
>>>>>>> intermediate abstraction layer which can easily be implemented and
>>>>>>
>>>>>>
>>>>>> utilized
>>>>>>>
>>>>>>>
>>>>>>> across several different processing frameworks.
>>>>>>>
>>>>>>> === An excessive fascination with the Apache brand ===
>>>>>>>
>>>>>>> With this proposal we are not seeking attention or publicity. Rather,
>>>>>>
>>>>>>
>>>>>> we
>>>>>>>
>>>>>>>
>>>>>>> firmly believe in the Dataflow model, SDK, and the ability to make
>>>>>>
>>>>>>
>>>>>> Dataflow
>>>>>>>
>>>>>>>
>>>>>>> a powerful yet simple framework for data processing. While the
>>>>>>
>>>>>>
>>>>>> Dataflow
>>>>>> SDK
>>>>>>>
>>>>>>>
>>>>>>> and model have been open source, we believe putting code on GitHub
>>>>>>> can
>>>>>>
>>>>>>
>>>>>> only
>>>>>>>
>>>>>>>
>>>>>>> go so far. We see the Apache community, processes, and mission as
>>>>>>
>>>>>>
>>>>>> critical
>>>>>>>
>>>>>>>
>>>>>>> for ensuring the Dataflow SDK and model are truly community-driven,
>>>>>>> positively impactful, and innovative open source software. While
>>>>>>
>>>>>>
>>>>>> Google
>>>>>> has
>>>>>>>
>>>>>>>
>>>>>>> taken a number of steps to advance its various open source projects,
>>>>>>
>>>>>>
>>>>>> we
>>>>>>>
>>>>>>>
>>>>>>> believe Dataflow is a great fit for the Apache Software Foundation
>>>>>>
>>>>>>
>>>>>> due to
>>>>>>>
>>>>>>>
>>>>>>> its focus on data processing and its relationships to existing ASF
>>>>>>> projects.
>>>>>>>
>>>>>>> == Documentation ==
>>>>>>>
>>>>>>> The following documentation is relevant to this proposal. Relevant
>>>>>>
>>>>>>
>>>>>> portion
>>>>>>>
>>>>>>>
>>>>>>> of the documentation will be contributed to the Apache Dataflow
>>>>>>
>>>>>>
>>>>>> project.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * Dataflow website: https://cloud.google.com/dataflow
>>>>>>>
>>>>>>> * Dataflow programming model:
>>>>>>> https://cloud.google.com/dataflow/model/programming-model
>>>>>>>
>>>>>>> * Codebases
>>>>>>>
>>>>>>> ** Dataflow Java SDK:
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>
>>>>>>> ** Flink Dataflow runner:
>>>>>>
>>>>>>
>>>>>> https://github.com/dataArtisans/flink-dataflow
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ** Spark Dataflow runner: https://github.com/cloudera/spark-dataflow
>>>>>>>
>>>>>>> * Dataflow Java SDK issue tracker:
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
>>>>>>>
>>>>>>> * google-cloud-dataflow tag on Stack Overflow:
>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow
>>>>>>>
>>>>>>> == Initial Source ==
>>>>>>>
>>>>>>> The initial source for Dataflow which we will submit to the Apache
>>>>>>> Foundation will include several related projects which are currently
>>>>>>
>>>>>>
>>>>>> hosted
>>>>>>>
>>>>>>>
>>>>>>> on the GitHub repositories:
>>>>>>>
>>>>>>> * Dataflow Java SDK (
>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
>>>>>>>
>>>>>>> * Flink Dataflow runner
>>>>>>
>>>>>>
>>>>>> (https://github.com/dataArtisans/flink-dataflow)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * Spark Dataflow runner (https://github.com/cloudera/spark-dataflow)
>>>>>>>
>>>>>>> These projects have always been Apache 2.0 licensed. We intend to
>>>>>>
>>>>>>
>>>>>> bundle
>>>>>>>
>>>>>>>
>>>>>>> all of these repositories since they are all complimentary and should
>>>>>>
>>>>>>
>>>>>> be
>>>>>>>
>>>>>>>
>>>>>>> maintained in one project. Prior to our submission, we will combine
>>>>>>
>>>>>>
>>>>>> all
>>>>>> of
>>>>>>>
>>>>>>>
>>>>>>> these projects into a new git repository.
>>>>>>>
>>>>>>> == Source and Intellectual Property Submission Plan ==
>>>>>>>
>>>>>>> The source for the Dataflow SDK and the three runners (Spark, Flink,
>>>>>>
>>>>>>
>>>>>> Google
>>>>>>>
>>>>>>>
>>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license.
>>>>>>>
>>>>>>> * Dataflow SDK -
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENS
>>>>>> E
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * Flink runner -
>>>>>>> https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
>>>>>>>
>>>>>>> * Spark runner -
>>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
>>>>>>>
>>>>>>> Contributors to the Dataflow SDK have also signed the Google
>>>>>>
>>>>>>
>>>>>> Individual
>>>>>>>
>>>>>>>
>>>>>>> Contributor License Agreement (
>>>>>>> https://cla.developers.google.com/about/google-individual) in order
>>>>>>> to
>>>>>>> contribute to the project.
>>>>>>>
>>>>>>> With respect to trademark rights, Google does not hold a trademark on
>>>>>>
>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>> phrase ³Dataflow.² Based on feedback and guidance we receive during
>>>>>>
>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>> incubation process, we are open to renaming the project if necessary
>>>>>>
>>>>>>
>>>>>> for
>>>>>>>
>>>>>>>
>>>>>>> trademark or other concerns.
>>>>>>>
>>>>>>> == External Dependencies ==
>>>>>>>
>>>>>>> All external dependencies are licensed under an Apache 2.0 or
>>>>>>> Apache-compatible license. As we grow the Dataflow community we will
>>>>>>> configure our build process to require and validate all contributions
>>>>>>
>>>>>>
>>>>>> and
>>>>>>>
>>>>>>>
>>>>>>> dependencies are licensed under the Apache 2.0 license or are under
>>>>>>> an
>>>>>>> Apache-compatible license.
>>>>>>>
>>>>>>> == Required Resources ==
>>>>>>>
>>>>>>> === Mailing Lists ===
>>>>>>>
>>>>>>> We currently use a mix of mailing lists. We will migrate our existing
>>>>>>> mailing lists to the following:
>>>>>>>
>>>>>>> * dev@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * user@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * private@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> * commits@dataflow.incubator.apache.org
>>>>>>>
>>>>>>> === Source Control ===
>>>>>>>
>>>>>>> The Dataflow team currently uses Git and would like to continue to do
>>>>>>
>>>>>>
>>>>>> so.
>>>>>>>
>>>>>>>
>>>>>>> We request a Git repository for Dataflow with mirroring to GitHub
>>>>>>
>>>>>>
>>>>>> enabled.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> === Issue Tracking ===
>>>>>>>
>>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow
>>>>>>
>>>>>>
>>>>>> project is
>>>>>>>
>>>>>>>
>>>>>>> currently using both a public GitHub issue tracker and internal
>>>>>>> Google
>>>>>>> issue tracking. We will migrate and combine from these two sources to
>>>>>>
>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>> Apache JIRA.
>>>>>>>
>>>>>>> == Initial Committers ==
>>>>>>>
>>>>>>> * Aljoscha Krettek     [aljoscha@apache.org]
>>>>>>>
>>>>>>> * Amit Sela            [amitsela33@gmail.com]
>>>>>>>
>>>>>>> * Ben Chambers         [bchambers@google.com]
>>>>>>>
>>>>>>> * Craig Chambers       [chambers@google.com]
>>>>>>>
>>>>>>> * Dan Halperin         [dhalperi@google.com]
>>>>>>>
>>>>>>> * Davor Bonaci         [davor@google.com]
>>>>>>>
>>>>>>> * Frances Perry        [fjp@google.com]
>>>>>>>
>>>>>>> * James Malone         [jamesmalone@google.com]
>>>>>>>
>>>>>>> * Jean-Baptiste Onofré [jbonofre@apache.org]
>>>>>>>
>>>>>>> * Josh Wills           [jwills@apache.org]
>>>>>>>
>>>>>>> * Kostas Tzoumas       [kostas@data-artisans.com]
>>>>>>>
>>>>>>> * Kenneth Knowles      [klk@google.com]
>>>>>>>
>>>>>>> * Luke Cwik            [lcwik@google.com]
>>>>>>>
>>>>>>> * Maximilian Michels   [mxm@apache.org]
>>>>>>>
>>>>>>> * Stephan Ewen         [stephan@data-artisans.com]
>>>>>>>
>>>>>>> * Tom White            [tom@cloudera.com]
>>>>>>>
>>>>>>> * Tyler Akidau         [takidau@google.com]
>>>>>>>
>>>>>>> == Affiliations ==
>>>>>>>
>>>>>>> The initial committers are from six organizations. Google developed
>>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the Flink
>>>>>>
>>>>>>
>>>>>> runner,
>>>>>>>
>>>>>>>
>>>>>>> and Cloudera (Labs) developed the Spark runner.
>>>>>>>
>>>>>>> * Cloudera
>>>>>>>
>>>>>>> ** Tom White
>>>>>>>
>>>>>>> * Data Artisans
>>>>>>>
>>>>>>> ** Aljoscha Krettek
>>>>>>>
>>>>>>> ** Kostas Tzoumas
>>>>>>>
>>>>>>> ** Maximilian Michels
>>>>>>>
>>>>>>> ** Stephan Ewen
>>>>>>>
>>>>>>> * Google
>>>>>>>
>>>>>>> ** Ben Chambers
>>>>>>>
>>>>>>> ** Dan Halperin
>>>>>>>
>>>>>>> ** Davor Bonaci
>>>>>>>
>>>>>>> ** Frances Perry
>>>>>>>
>>>>>>> ** James Malone
>>>>>>>
>>>>>>> ** Kenneth Knowles
>>>>>>>
>>>>>>> ** Luke Cwik
>>>>>>>
>>>>>>> ** Tyler Akidau
>>>>>>>
>>>>>>> * PayPal
>>>>>>>
>>>>>>> ** Amit Sela
>>>>>>>
>>>>>>> * Slack
>>>>>>>
>>>>>>> ** Josh Wills
>>>>>>>
>>>>>>> * Talend
>>>>>>>
>>>>>>> ** Jean-Baptiste Onofré
>>>>>>>
>>>>>>> == Sponsors ==
>>>>>>>
>>>>>>> === Champion ===
>>>>>>>
>>>>>>> * Jean-Baptiste Onofre      [jbonofre@apache.org]
>>>>>>>
>>>>>>> === Nominated Mentors ===
>>>>>>>
>>>>>>> * Jim Jagielski           [jim@apache.org]
>>>>>>>
>>>>>>> * Venkatesh Seetharam     [venkatesh@apache.org]
>>>>>>>
>>>>>>> * Bertrand Delacretaz     [bdelacretaz@apache.org]
>>>>>>>
>>>>>>> * Ted Dunning             [tdunning@apache.org]
>>>>>>>
>>>>>>> === Sponsoring Entity ===
>>>>>>>
>>>>>>> The Apache Incubator
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sean
>>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>
>>
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message