incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré ...@nanthrax.net>
Subject Re: [DISCUSS] Apache Dataflow Incubator Proposal
Date Mon, 25 Jan 2016 06:38:56 GMT
Hey Ajay,

great: I added you on the proposal.

Thanks !
Regards
JB

On 01/25/2016 06:25 AM, Ajay Yadav wrote:
> Great proposal. I would also like to contribute to the project especially
> the Python SDK, if possible.
>
> Cheers
> Ajay Yadava
>
> On Sun, Jan 24, 2016 at 1:25 AM, Jean-Baptiste Onofré <jb@nanthrax.net>
> wrote:
>
>> Hi Seshu,
>>
>> it does both: streaming and batching data processing.
>>
>> Regards
>> JB
>>
>> On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote:
>>
>>> Did not get a chance to play with it yet, Within Google is it used more as
>>> a MR replacement or a Stream processing engine? Or it does both of them
>>> fantastically well?
>>>
>>>
>>> On 1/22/16, 10:58 AM, "Frances Perry" <fjp@google.com.INVALID> wrote:
>>>
>>> Crunch started as a clone of FlumeJava, which was Google internal. In the
>>>> meantime inside Google, FlumeJava evolved into Dataflow. So all three
>>>> share
>>>> a number of concepts like PCollections, ParDo, DoFn, etc. However,
>>>> Dataflow
>>>> adds a number of new things -- the biggest being a unified
>>>> batch/streaming
>>>> semantics using concepts like Windowing and Triggers. Tyler Akidau's
>>>> OReilly post has a really nice explanation:
>>>> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>>>>
>>>> On Fri, Jan 22, 2016 at 10:42 AM, Ashish <paliwalashish@gmail.com>
>>>> wrote:
>>>>
>>>> Crunch has Spark pipelines, but not sure about the runner abstraction.
>>>>>
>>>>> May be Josh Wills or Tom White can provide more insight on this topic.
>>>>> They are core devs for both projects :)
>>>>>
>>>>> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <jb@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
>>>>>>
>>>>> pipeline, it
>>>>>
>>>>>> doesn't provide runner abstraction. It's based on FlumeJava.
>>>>>>
>>>>>> The logic is very similar (with DoFns, pipelines, ...). Correct me if
>>>>>>
>>>>> I'm
>>>>>
>>>>>> wrong, but Crunch started after Google Dataflow, especially because
>>>>>>
>>>>> Dataflow
>>>>>
>>>>>> was not opensourced at that time.
>>>>>>
>>>>>> So, I agree it's very similar/close.
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 01/22/2016 05:51 PM, Ashish wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi JB,
>>>>>>>
>>>>>>> Curious to know about how it compares to Apache Crunch? Constructs
>>>>>>> looks very familiar (had used Crunch long ago)
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> - Ashish
>>>>>>>
>>>>>>> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré
>>>>>>>
>>>>>> <jb@nanthrax.net>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Seshu,
>>>>>>>>
>>>>>>>> I blogged about Apache Dataflow proposal:
>>>>>>>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>>>>>>>>
>>>>>>>> You can see in the "what's next ?" section that new runners, skins
>>>>>>>>
>>>>>>> and
>>>>>
>>>>>> sources are on our roadmap. Definitely, a storm runner could be
>>>>>>>>
>>>>>>> part of
>>>>>
>>>>>> this.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Awesome to see CloudDataFlow coming to Apache. The Stream
>>>>>>>>>
>>>>>>>> Processing
>>>>>
>>>>>> area
>>>>>>>>> has been in general fragmented with a variety of solutions, hoping
>>>>>>>>>
>>>>>>>> the
>>>>>
>>>>>> community galvanizes around Apache Data Flow.
>>>>>>>>>
>>>>>>>>> We are still in the "Apache Storm" world, Any chance for folks
>>>>>>>>>
>>>>>>>> building
>>>>>
>>>>>> a
>>>>>>>>> "Storm Runner²?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/20/16, 9:39 AM, "James Malone"
>>>>>>>>>
>>>>>>>> <jamesmalone@google.com.INVALID>
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Great proposal. I like that your proposal includes a well
>>>>>>>>>>>
>>>>>>>>>> presented
>>>>>
>>>>>> roadmap, but I don't see any goals that directly address
>>>>>>>>>>>
>>>>>>>>>> building a
>>>>>
>>>>>> larger
>>>>>>>>>>> community. Y'all have any ideas around outreach that will help
>>>>>>>>>>>
>>>>>>>>>> with
>>>>>
>>>>>> adoption?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Thank you and fair point. We have a few additional ideas which we
>>>>>>>>>>
>>>>>>>>> can
>>>>>
>>>>>> put
>>>>>>>>>> into the Community section.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> As a start, I recommend y'all add a section to the proposal on
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>
>>>>>> wiki
>>>>>>>>>>> page for "Additional Interested Contributors" so that folks who
>>>>>>>>>>>
>>>>>>>>>> want
>>>>>
>>>>>> to
>>>>>>>>>>> sign up to participate in the project can do so without
>>>>>>>>>>>
>>>>>>>>>> requesting
>>>>>
>>>>>> additions to the initial committer list.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is a great idea and I think it makes a lot of sense to add an
>>>>>>>>>> "Additional
>>>>>>>>>> Interested Contributors" section to the proposal.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>>>>>>>>>>> jamesmalone@google.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> Attached to this message is a proposed new project - Apache
>>>>>>>>>>>>
>>>>>>>>>>> Dataflow,
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> unified programming model for data processing and integration.
>>>>>>>>>>>>
>>>>>>>>>>>> The text of the proposal is included below. Additionally, the
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> proposal is
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> in draft form on the wiki where we will make any required
>>>>>>>>>>>>
>>>>>>>>>>> changes:
>>>>>
>>>>>>
>>>>>>>>>>>> https://wiki.apache.org/incubator/DataflowProposal
>>>>>>>>>>>>
>>>>>>>>>>>> We look forward to your feedback and input.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> James
>>>>>>>>>>>>
>>>>>>>>>>>> ----
>>>>>>>>>>>>
>>>>>>>>>>>> = Apache Dataflow =
>>>>>>>>>>>>
>>>>>>>>>>>> == Abstract ==
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow is an open source, unified model and set of
>>>>>>>>>>>> language-specific
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> SDKs
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> for defining and executing data processing workflows, and also
>>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>
>>>>>> ingestion and integration flows, supporting Enterprise
>>>>>>>>>>>>
>>>>>>>>>>> Integration
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Patterns
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> simplify
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the mechanics of large-scale batch and streaming data processing
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> can
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Google
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> different
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> languages, allowing users to easily implement their data
>>>>>>>>>>>>
>>>>>>>>>>> integration
>>>>>
>>>>>> processes.
>>>>>>>>>>>>
>>>>>>>>>>>> == Proposal ==
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow is a simple, flexible, and powerful system for
>>>>>>>>>>>>
>>>>>>>>>>> distributed
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> processing at any scale. Dataflow provides a unified programming
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> model, a
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> software development kit to define and construct data processing
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> pipelines,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and runners to execute Dataflow pipelines in several runtime
>>>>>>>>>>>>
>>>>>>>>>>> engines,
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> like
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow
>>>>>>>>>>>>
>>>>>>>>>>> can
>>>>> be
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> used
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> for a variety of streaming or batch data processing goals
>>>>>>>>>>>>
>>>>>>>>>>> including
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ETL,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> stream analysis, and aggregate computation. The underlying
>>>>>>>>>>>> programming
>>>>>>>>>>>> model for Dataflow provides MapReduce-like parallelism, combined
>>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>
>>>>>> support for powerful data windowing, and fine-grained
>>>>>>>>>>>>
>>>>>>>>>>> correctness
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> control.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == Background ==
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow started as a set of Google projects focused on making
>>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>
>>>>>> processing easier, faster, and less costly. The Dataflow model
>>>>>>>>>>>>
>>>>>>>>>>> is a
>>>>>
>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>> is
>>>>>
>>>>>> focused on providing a unified solution for batch and stream
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> processing.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> These projects on which Dataflow is based have been published in
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> several
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> papers made available to the public:
>>>>>>>>>>>>
>>>>>>>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>>>>>>>>>>
>>>>>>>>>>>> * Dataflow model  -
>>>>>>>>>>>>
>>>>>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>>
>>>>>>
>>>>>>>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>>>>>>>>>>
>>>>>>>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow was designed from the start to provide a portable
>>>>>>>>>>>> programming
>>>>>>>>>>>> layer. When you define a data processing pipeline with the
>>>>>>>>>>>>
>>>>>>>>>>> Dataflow
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> model,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> you are creating a job which is capable of being processed by
>>>>>>>>>>>>
>>>>>>>>>>> any
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> number
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow processing engines. Several engines have been
>>>>>>>>>>>>
>>>>>>>>>>> developed to
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> run
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow pipelines in other open source runtimes, including a
>>>>>>>>>>>> Dataflow
>>>>>>>>>>>> runner for Apache Flink and Apache Spark. There is also a
>>>>>>>>>>>>
>>>>>>>>>>> ³direct
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> runner²,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> for execution on the developer machine (mainly for dev/debug
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> purposes).
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Another runner allows a Dataflow program to run on a managed
>>>>>>>>>>>>
>>>>>>>>>>> service,
>>>>>
>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow
>>>>>>>>>>>>
>>>>>>>>>>> Java
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> SDK is
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> already available on GitHub, and independent from the Google
>>>>>>>>>>>>
>>>>>>>>>>> Cloud
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dataflow
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> service. Another Python SDK is currently in active development.
>>>>>>>>>>>>
>>>>>>>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners
>>>>>>>>>>>>
>>>>>>>>>>> will
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> be
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> submitted as an OSS project under the ASF. The runners which
>>>>>>>>>>>>
>>>>>>>>>>> are a
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> part
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> this proposal include those for Spark (from Cloudera), Flink
>>>>>>>>>>>>
>>>>>>>>>>> (from
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Artisans), and local development (from Google); the Google Cloud
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dataflow
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> service runner is not included in this proposal. Further
>>>>>>>>>>>>
>>>>>>>>>>> references
>>>>>
>>>>>> to
>>>>>>>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners
>>>>>>>>>>>>
>>>>>>>>>>> which
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> are a
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> part of this proposal (Apache Dataflow) only. The initial
>>>>>>>>>>>>
>>>>>>>>>>> submission
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> will
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> contain the already-released Java SDK; Google intends to submit
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Python
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> SDK later in the incubation process. The Google Cloud Dataflow
>>>>>>>>>>>> service
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> will
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> continue to be one of many runners for Dataflow, built on Google
>>>>>>>>>>>> Cloud
>>>>>>>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow
>>>>>>>>>>>>
>>>>>>>>>>> will
>>>>>
>>>>>> develop against the Apache project additions, updates, and
>>>>>>>>>>>>
>>>>>>>>>>> changes.
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Google
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> participate
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> in the project openly and publicly.
>>>>>>>>>>>>
>>>>>>>>>>>> The Dataflow programming model has been designed with
>>>>>>>>>>>>
>>>>>>>>>>> simplicity,
>>>>>
>>>>>> scalability, and speed as key tenants. In the Dataflow model,
>>>>>>>>>>>>
>>>>>>>>>>> you
>>>>>
>>>>>> only
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> need
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> to think about four top-level concepts when constructing your
>>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>
>>>>>> processing job:
>>>>>>>>>>>>
>>>>>>>>>>>> * Pipelines - The data processing job made of a series of
>>>>>>>>>>>> computations
>>>>>>>>>>>> including input, processing, and output
>>>>>>>>>>>>
>>>>>>>>>>>> * PCollections - Bounded (or unbounded) datasets which represent
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> input,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> intermediate and output data in pipelines
>>>>>>>>>>>>
>>>>>>>>>>>> * PTransforms - A data processing step in a pipeline in which
>>>>>>>>>>>>
>>>>>>>>>>> one
>>>>> or
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> more
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> PCollections are an input and output
>>>>>>>>>>>>
>>>>>>>>>>>> * I/O Sources and Sinks - APIs for reading and writing data
>>>>>>>>>>>>
>>>>>>>>>>> which
>>>>> are
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> roots and endpoints of the pipeline
>>>>>>>>>>>>
>>>>>>>>>>>> == Rationale ==
>>>>>>>>>>>>
>>>>>>>>>>>> With Dataflow, Google intended to develop a framework which
>>>>>>>>>>>>
>>>>>>>>>>> allowed
>>>>>
>>>>>> developers to be maximally productive in defining the
>>>>>>>>>>>>
>>>>>>>>>>> processing,
>>>>> and
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> then
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> be able to execute the program at various levels of
>>>>>>>>>>>> latency/cost/completeness without re-architecting or re-writing
>>>>>>>>>>>>
>>>>>>>>>>> it.
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> goal was informed by Google¹s past experience  developing
>>>>>>>>>>>>
>>>>>>>>>>> several
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> models,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> frameworks, and tools useful for large-scale and distributed
>>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>
>>>>>> processing. While Google has previously published papers
>>>>>>>>>>>>
>>>>>>>>>>> describing
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> some
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> its technologies, Google decided to take a different approach
>>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>
>>>>>> Dataflow. Google open-sourced the SDK and model alongside
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> commercialization
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> of the idea and ahead of publishing papers on the topic. As a
>>>>>>>>>>>>
>>>>>>>>>>> result,
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> number of open source runtimes exist for Dataflow, such as the
>>>>>>>>>>>>
>>>>>>>>>>> Apache
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Flink
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and Apache Spark runners.
>>>>>>>>>>>>
>>>>>>>>>>>> We believe that submitting Dataflow as an Apache project will
>>>>>>>>>>>>
>>>>>>>>>>> provide
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> an
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> immediate, worthwhile, and substantial contribution to the open
>>>>>>>>>>>> source
>>>>>>>>>>>> community. As an incubating project, we believe Dataflow will
>>>>>>>>>>>>
>>>>>>>>>>> have
>>>>> a
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> better
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> opportunity to provide a meaningful contribution to OSS and also
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> integrate
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> with other Apache projects.
>>>>>>>>>>>>
>>>>>>>>>>>> In the long term, we believe Dataflow can be a powerful
>>>>>>>>>>>>
>>>>>>>>>>> abstraction
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> layer
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> for data processing. By providing an abstraction layer for data
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> pipelines
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and processing, data workflows can be increasingly portable,
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> resilient to
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> breaking changes in tooling, and compatible across many
>>>>>>>>>>>>
>>>>>>>>>>> execution
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> engines,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> runtimes, and open source projects.
>>>>>>>>>>>>
>>>>>>>>>>>> == Initial Goals ==
>>>>>>>>>>>>
>>>>>>>>>>>> We are breaking our initial goals into immediate (< 2 months),
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> short-term
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> (2-4 months), and intermediate-term (> 4 months).
>>>>>>>>>>>>
>>>>>>>>>>>> Our immediate goals include the following:
>>>>>>>>>>>>
>>>>>>>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners
>>>>>>>>>>>>
>>>>>>>>>>> into
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> one
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> project
>>>>>>>>>>>>
>>>>>>>>>>>> * Plan for refactoring the existing Java SDK for better
>>>>>>>>>>>>
>>>>>>>>>>> extensibility
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> by
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> SDK and runner writers
>>>>>>>>>>>>
>>>>>>>>>>>> * Validating all dependencies are ASL 2.0 or compatible
>>>>>>>>>>>>
>>>>>>>>>>>> * Understanding and adapting to the Apache development process
>>>>>>>>>>>>
>>>>>>>>>>>> Our short-term goals include:
>>>>>>>>>>>>
>>>>>>>>>>>> * Moving the newly-merged lists, and build utilities to Apache
>>>>>>>>>>>>
>>>>>>>>>>>> * Start refactoring codebase and move code to Apache Git repo
>>>>>>>>>>>>
>>>>>>>>>>>> * Continue development of new features, functions, and fixes in
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>> Dataflow Java SDK, and Dataflow runners
>>>>>>>>>>>>
>>>>>>>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>
>>>>>> plan
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> how to include new major ideas, modules, and runtimes
>>>>>>>>>>>>
>>>>>>>>>>>> * Establishment of easy and clear build/test framework for
>>>>>>>>>>>>
>>>>>>>>>>> Dataflow
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> associated runtimes; creation of testing, rollback, and
>>>>>>>>>>>>
>>>>>>>>>>> validation
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> policy
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> * Analysis and design for work needed to make Dataflow a better
>>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>
>>>>>> processing abstraction layer for multiple open source frameworks
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>
>>>>>> environments
>>>>>>>>>>>>
>>>>>>>>>>>> Finally, we have a number of intermediate-term goals:
>>>>>>>>>>>>
>>>>>>>>>>>> * Roadmapping, planning, and execution of integrations with
>>>>>>>>>>>>
>>>>>>>>>>> other
>>>>> OSS
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> non-OSS projects/products
>>>>>>>>>>>>
>>>>>>>>>>>> * Inclusion of additional SDK for Python, which is under active
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> development
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == Current Status ==
>>>>>>>>>>>>
>>>>>>>>>>>> === Meritocracy ===
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow was initially developed based on ideas from many
>>>>>>>>>>>>
>>>>>>>>>>> employees
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> within
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has
>>>>>>>>>>>> received
>>>>>>>>>>>> contributions from data Artisans, Cloudera Labs, and other
>>>>>>>>>>>>
>>>>>>>>>>> individual
>>>>>
>>>>>> developers. As a project under incubation, we are committed to
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> expanding
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> our effort to build an environment which supports a
>>>>>>>>>>>>
>>>>>>>>>>> meritocracy. We
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> are
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> focused on engaging the community and other related projects for
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> support
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and contributions. Moreover, we are committed to ensure
>>>>>>>>>>>>
>>>>>>>>>>> contributors
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> committers to Dataflow come from a broad mix of organizations
>>>>>>>>>>>>
>>>>>>>>>>> through
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> merit-based decision process during incubation. We believe
>>>>>>>>>>>>
>>>>>>>>>>> strongly
>>>>>
>>>>>> in
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Dataflow model and are committed to growing an inclusive
>>>>>>>>>>>>
>>>>>>>>>>> community
>>>>> of
>>>>>
>>>>>> Dataflow contributors.
>>>>>>>>>>>>
>>>>>>>>>>>> === Community ===
>>>>>>>>>>>>
>>>>>>>>>>>> The core of the Dataflow Java SDK has been developed by Google
>>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>
>>>>>> use
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Google Cloud Dataflow. Google has active community engagement in
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> SDK
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> GitHub repository (
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ),
>>>>>>>>>>>> on Stack Overflow (
>>>>>>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> has
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> had contributions from a number of organizations and
>>>>>>>>>>>>
>>>>>>>>>>> indivuduals.
>>>>>
>>>>>>
>>>>>>>>>>>> Everyday, Cloud Dataflow is actively used by a number of
>>>>>>>>>>>> organizations
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> institutions for batch and stream processing of data. We believe
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> acceptance
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> will allow us to consolidate existing Dataflow-related work,
>>>>>>>>>>>>
>>>>>>>>>>> grow
>>>>> the
>>>>>
>>>>>> Dataflow community, and deepen connections between Dataflow and
>>>>>>>>>>>>
>>>>>>>>>>> other
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> open
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> source projects.
>>>>>>>>>>>>
>>>>>>>>>>>> === Core Developers ===
>>>>>>>>>>>>
>>>>>>>>>>>> The core developers for Dataflow and the Dataflow runners are:
>>>>>>>>>>>>
>>>>>>>>>>>> * Frances Perry
>>>>>>>>>>>>
>>>>>>>>>>>> * Tyler Akidau
>>>>>>>>>>>>
>>>>>>>>>>>> * Davor Bonaci
>>>>>>>>>>>>
>>>>>>>>>>>> * Luke Cwik
>>>>>>>>>>>>
>>>>>>>>>>>> * Ben Chambers
>>>>>>>>>>>>
>>>>>>>>>>>> * Kenn Knowles
>>>>>>>>>>>>
>>>>>>>>>>>> * Dan Halperin
>>>>>>>>>>>>
>>>>>>>>>>>> * Daniel Mills
>>>>>>>>>>>>
>>>>>>>>>>>> * Mark Shields
>>>>>>>>>>>>
>>>>>>>>>>>> * Craig Chambers
>>>>>>>>>>>>
>>>>>>>>>>>> * Maximilian Michels
>>>>>>>>>>>>
>>>>>>>>>>>> * Tom White
>>>>>>>>>>>>
>>>>>>>>>>>> * Josh Wills
>>>>>>>>>>>>
>>>>>>>>>>>> === Alignment ===
>>>>>>>>>>>>
>>>>>>>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which
>>>>>>>>>>>>
>>>>>>>>>>> can
>>>>>
>>>>>> be
>>>>>>>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also
>>>>>>>>>>>>
>>>>>>>>>>> related
>>>>> to
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> other
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Apache projects, such as Apache Crunch. We plan on expanding
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> functionality
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> for Dataflow runners, support for additional domain specific
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> languages,
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> increased portability so Dataflow is a powerful abstraction
>>>>>>>>>>>>
>>>>>>>>>>> layer
>>>>> for
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> data
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> processing.
>>>>>>>>>>>>
>>>>>>>>>>>> == Known Risks ==
>>>>>>>>>>>>
>>>>>>>>>>>> === Orphaned Products ===
>>>>>>>>>>>>
>>>>>>>>>>>> The Dataflow SDK is presently used by several organizations,
>>>>>>>>>>>>
>>>>>>>>>>> from
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> small
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> startups to Fortune 100 companies, to construct production
>>>>>>>>>>>>
>>>>>>>>>>> pipelines
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> which
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> are executed in Google Cloud Dataflow. Google has a long-term
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> commitment
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing
>>>>>>>>>>>>
>>>>>>>>>>> increasing
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> interest,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> development, and adoption from organizations outside of Google.
>>>>>>>>>>>>
>>>>>>>>>>>> === Inexperience with Open Source ===
>>>>>>>>>>>>
>>>>>>>>>>>> Google believes strongly in open source and the exchange of
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> information
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> advance new ideas and work. Examples of this commitment are
>>>>>>>>>>>>
>>>>>>>>>>> active
>>>>>
>>>>>> OSS
>>>>>>>>>>>> projects such as Chromium (https://www.chromium.org) and
>>>>>>>>>>>>
>>>>>>>>>>> Kubernetes (
>>>>>
>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> increasingly
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> open and forward-looking; we have published a paper in the VLDB
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> conference
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> describing the Dataflow model (
>>>>>>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> release
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the Dataflow SDK as open source software with the launch of
>>>>>>>>>>>>
>>>>>>>>>>> Cloud
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dataflow.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Our submission to the Apache Software Foundation is a logical
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> extension
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> our commitment to open source software.
>>>>>>>>>>>>
>>>>>>>>>>>> === Homogeneous Developers ===
>>>>>>>>>>>>
>>>>>>>>>>>> The majority of committers in this proposal belong to Google
>>>>>>>>>>>>
>>>>>>>>>>> due to
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> fact that Dataflow has emerged from several internal Google
>>>>>>>>>>>>
>>>>>>>>>>> projects.
>>>>>
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> proposal also includes committers outside of Google who are
>>>>>>>>>>>>
>>>>>>>>>>> actively
>>>>>
>>>>>> involved with other Apache projects, such as Hadoop, Flink, and
>>>>>>>>>>>
>>>>>>>>>>>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message