incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Lamy <ol...@apache.org>
Subject Re: [discuss] Apache Gobblin Incubator Proposal
Date Thu, 16 Feb 2017 00:19:27 GMT
Hi
Thanks for the proposal Jim!
I will add you as a mentor then start the vote
Cheers
Olivier

On 16 February 2017 at 02:35, Jim Jagielski <jim@jagunet.com> wrote:

> If you need/want another mentor, I volunteer
>
> > On Feb 14, 2017, at 3:53 PM, Olivier Lamy <olamy@apache.org> wrote:
> >
> > Hi
> > Well I don't see issues as no one discuss the proposal.
> > So I will start the official vote tomorrow.
> > Cheers
> > Olivier
> >
> > On 6 February 2017 at 14:08, Olivier Lamy <olamy@apache.org> wrote:
> >
> >> Hello everyone,
> >> I would like to submit to you a proposal to bring Gooblin to the Apache
> >> Software Foundation.
> >> The text of the proposal is included below and available as a draft here
> >> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal
> >>
> >> We will appreciate any feedback and input.
> >>
> >> Olivier on behalf of the Gobblin community
> >>
> >>
> >> = Apache Gobblin Proposal =
> >> == Abstract ==
> >> Gobblin is a distributed data integration framework that simplifies
> common
> >> aspects of big data integration such as data ingestion, replication,
> >> organization and lifecycle management for both streaming and batch data
> >> ecosystems.
> >>
> >> == Proposal ==
> >>
> >> Gobblin is a universal data integration framework. The framework has
> been
> >> used to build a variety of big data applications such as ingestion,
> >> replication, and data retention. The fundamental constructs provided by
> the
> >> Gobblin framework are:
> >>
> >> 1. An expandable set of connectors that allow data to be integrated from
> >> a variety of sources and sinks. The range of connectors already
> available
> >> in Gobblin are quite diverse and are an ever expanding set. To highlight
> >> just a few examples, connectors exist for databases (e.g., MySQL, Oracle
> >> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP
> >> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming
> data
> >> (Kafka, EventHubs etc.), and a variety of proprietary data sources and
> >> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.).
> Similarly,
> >> Gobblin has a rich library of converters that allow for conversion of
> data
> >> from one format to another as data moves across system boundaries (e.g.
> >> AVRO in HDFS to JSON in another system).
> >>
> >>
> >> 2. Gobblin has a well defined and customizable state management layer
> >> that allows writing stateful applications. These are particularly useful
> >> when solving problems like bulk incremental ingest and keeping several
> >> clusters replicated in sync. The ability to record work that has been
> >> completed and what remains in a scalable manner is critical to writing
> such
> >> diverse applications successfully.
> >>
> >>
> >> 3. Gobblin is agnostic to the underlying execution engine. It can be
> >> tailored to run ontop of a variety of execution frameworks ranging from
> >> multiple processes on a single node, to open source execution engines
> like
> >> MapReduce, Spark or Samza, natively on top of raw containers like Yarn
> or
> >> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are
> >> extending Gobblin to run on top of a self managed cluster when security
> is
> >> vital.  This allows different applications that require different
> degrees
> >> of scalability, latency or security to be customized to for their
> specific
> >> needs. For example, highly latency sensitive applications can be
> executed
> >> in a streaming environment while batch based execution might benefit
> >> applications where the priority might be geared towards optimal
> container
> >> utilization.
> >>
> >> 4.Gobblin comes out of the box with several diagnosability features like
> >> Gobblin metrics and error handling. Collectively, these features allow
> >> Gobblin to operate at the scale of petabytes of data. To give just one
> >> example, the ability to quarantine a few bad records from an isolated
> Kafka
> >> topic without stopping the entire flow from continued execution is vital
> >> when the number of Kafka topics range in the thousands and the
> collective
> >> data handled is in the petabytes.
> >>
> >> Gobblin thus provides crisply defined software constructs that can be
> used
> >> to build a vast array of data integration applications customizable for
> >> varied user needs. It has become a preferred technology for data
> >> integration use-cases by many organizations worldwide (see a partial
> list
> >> here).
> >>
> >> == Background ==
> >>
> >> Over the last decade, data integration has evolved use case by use case
> in
> >> most companies. For example, at LinkedIn, when Kafka became a
> significant
> >> part of the data ecosystem, a system called Camus was built to ingest
> this
> >> data for analytics processing on Hadoop. Similarly, we had custom
> pipelines
> >> to ingest data from Salesforce, Oracle and myriad other sources. This
> >> pattern became the norm rather than the exception and one point,
> LinkedIn
> >> was running at least fifteen different types of ingestion pipelines.
> This
> >> fragmentation has several unfortunate implications. Operational costs
> scale
> >> with the number of pipelines even if the myriad pipelines share a vasty
> >> array of common features. Bug fixes and performance optimizations
> cannot be
> >> shared across the pipelines. A common set of practices around debugging
> and
> >> deployment does not emerge. Each pipeline operator will continue to
> invest
> >> in his little silo of the data integration world completely oblivious to
> >> the challenges of his fellow operator sitting five tables down.
> >>
> >> These experiences were the genesis behind the design and implementation
> of
> >> Gobblin. Gobblin thus started out as a universal data ingestion
> framework
> >> focussed on extracting, transforming, and synchronizing large volumes of
> >> data between different data sources and sinks. Not surprisingly, given
> its
> >> origins, the initial design of Gobblin placed great emphasis on
> >> abstractions that can be leveraged repeatedly. These abstractions have
> >> stood the test of time at LinkedIn and we have been able to leverage the
> >> constructs well beyond ingest. Gobblin's architecture has allowed us at
> >> LinkedIn to use it for a variety of applications ranging from from
> optimal
> >> format conversion to adhering to compliance policies set by European
> >> standards. Finally, as noted earlier, Gobblin can be deployed in a
> variety
> >> of execution environments: it can be deployed as a library embedded in
> >> another application or can be used to execute jobs on a public cloud. A
> >> fluid architectural and execution design story has allowed Gobblin to
> >> become a truly successful data integration platform.
> >>
> >> Gobblin has continued to evolve with a variety of utility packages like
> >> Gobblin metrics and Gobblin config management. Collectively, these allow
> >> organizations utilizing Gobblin to use a system that has been battle
> tested
> >> at LinkedIn scale. This is something that its consumers have to come to
> >> appreciate greatly.
> >>
> >> == Rationale ==
> >>
> >> Gobblin's entry to the Apache foundation is beneficial to both the
> Gobblin
> >> and the Apache communities. Gobblin has greatly benefited from its open
> >> source roots. Its community and adoption has grown greatly as a result.
> >> More importantly, the feedback from the community whether through
> >> interactions at meetups or through the mailing list have allowed for a
> rich
> >> exchange of ideas. In order to grow up the Gobblin community and improve
> >> the project, we would like to propose Gobblin to the Apache incubator.
> The
> >> Gobblin community will greatly benefit from the established development
> and
> >> consensus processes that have worked well for other projects. The Apache
> >> process has served many other open source projects well and we believe
> that
> >> the Gobblin community will greatly benefit from these practices as well.
> >>
> >> == Initial Goals ==
> >>
> >> Migrate the existing codebase to Apache
> >> Study and Integrate with the Apache development process
> >> Ensure all dependencies are compliant with Apache License version 2.0
> >> Incremental development and releases per Apache guidelines
> >> Improve the relationship between Gobblin and other Apache projects
> >>
> >> == Current Status ==
> >>
> >> Gobblin has undergone five major releases (0.5, 0.6, 0.7, 0.8, 0.9) and
> >> many minor ones. The latest version, Gobblin 0.9 has just been released
> in
> >> December, 2016. Gobblin is being used in production by over 20
> >> organizations. Gobblin codebase is currently hosted at github.com,
> which
> >> will seed the Apache git repository.
> >>
> >> === Meritocracy ===
> >>
> >> We plan to invest in supporting a meritocracy. We will discuss the
> >> requirements in an open forum. Several companies have already expressed
> >> interest in this project, and we intend to invite additional developers
> to
> >> participate. We will encourage and monitor community participation so
> that
> >> privileges can be extended to those that contribute.
> >>
> >> === Community ===
> >>
> >> The need for a extensible and flexible data integration platform in the
> >> open source is tremendous. Gobblin is currently being used by at least
> 20
> >> organizations worldwide (some examples are listed here). By bringing
> >> Gobblin into Apache, we believe that the community will grow even
> bigger.
> >>
> >> === Core Developers ===
> >>
> >> Gobblin was started by engineers at LinkedIn, and now has developers
> from
> >> Google, Facebook, LinkedIn, Cloudera, Nerdwallet, Swisscom, and many
> other
> >> companies.
> >>
> >> === Alignment ===
> >>
> >> Gobblin aligns exceedingly well with the Apache ecosystem. Gobblin is
> >> built leveraging several existing Apache projects (Apache Helix, Yarn,
> >> Zookeeper etc.). As Gobblin matures, we expect to leverage several other
> >> Apache projects further. This leverage invariably results in
> contributions
> >> back to these projects (e.g., a contribution to Helix was made during
> the
> >> Gobblin Yarn development). Finally, being an integration platform, it
> >> serves as a bridge between several Apache projects like Apache Hadoop
> and
> >> Apache Kafka. This integration is highly desired and their interaction
> >> through Gobblin will lead to a virtuous cycle of greater adoption and
> newer
> >> features in these projects. Thus, we believe that it will be a nice
> >> addition to the current set of big data projects under the auspices of
> the
> >> Apache foundation.
> >>
> >> == Known Risks ==
> >>
> >> === Orphaned Products ===
> >>
> >> The risk of the Gobblin project being abandoned is minimal. As noted
> >> earlier, there are many organizations that have already invested in
> Gobblin
> >> significantly and are thus incentivized to continue development. Many of
> >> these organizations operate critical data ingest, compliance and
> retention
> >> pipelines built with Gobblin and are thus heavily invested in the
> continued
> >> success of Gobblin.
> >>
> >> === Inexperience with Open Source ===
> >>
> >> Gobblin has existed as a healthy open source project for several years.
> >> During that time, we have curated an open-source community successfully.
> >> Any risks that we foresee are ones associated with scaling our open
> source
> >> communication and operation process rather than with inherent
> inexperience
> >> in operating an open source project.
> >>
> >> === Homogenous Developers ===
> >>
> >> Gobblin’s committers are employed by companies of varying sizes and
> >> industry. Committers come from well heeled internet companies like
> Google,
> >> LinkedIn and Facebook. We also have developers from traditional
> enterprise
> >> companies like SwissCom. Well funded startups like Nerdwallet are
> active in
> >> the community of developers. We  plan to double our efforts in
> cultivating
> >> a diverse set of committers for Gobblin.
> >>
> >> === Reliance on Salaried Developers ===
> >>
> >> It is expected that Gobblin development will occur on both salaried time
> >> and on volunteer time, after hours. The majority of initial committers
> are
> >> paid by their employer to contribute to this project. However, they are
> all
> >> passionate about the project, and we are confident that the project will
> >> continue even if no salaried developers contribute to the project. We
> are
> >> committed to recruiting additional committers including non-salaried
> >> developers.
> >>
> >> === Relationships with Other Apache Products ===
> >>
> >> As noted earlier, Gobblin leverages several open source projects and
> >> contributes back to them. There is also overlap with aspects of other
> >> Apache projects that we will discuss briefly here. Apache Nifi, like
> >> Gobblin aspires to reduce the operational overhead arising from data
> >> heterogeneity. Apache Nifi is structured as a visual flow based approach
> >> and provides built-in constructs for buffering data, prioritizing data,
> and
> >> understanding data lineage as data flows across systems. Apache Nifi has
> >> its own dataflow based execution engine with buffering, scheduling and
> >> streaming capabilities. Apache Falcon is a Hadoop centric data
> governance
> >> engine for defining, scheduling, and monitoring data management policies
> >> through flow definition typically for data that has been ingested into
> >> Hadoop already. Apache Falcon generally delegates data management jobs
> to
> >> tools that already exist in the Hadoop ecosystem (e.g. Distcp, Sqoop,
> Hive
> >> etc). Apache Sqoop is primarily geared for bulk ingest especially from
> >> databases which is one part of Gobblin’s feature set. Apache Flume
> focuses
> >> primarily on streaming data movement. Finally, general purpose data
> >> processing engines like Apache Flink, Apache Samza, and Apache Spark
> focus
> >> on generic computation.
> >>
> >> Gobblin design choices intersect with specific features in all of these
> >> systems, however in aggregate, it is a different point in the design
> space.
> >> It is designed to handle both streaming and batch data. It supports
> >> execution through a standalone cluster mode as well as through existing
> >> frameworks such as MR, Yarn, Hive, Samza etc allowing users to choose
> the
> >> deployment model that is optimal for the specific data integration
> >> challenge. It provides native optimized implementations for critical
> >> integrations such as Kafka, Hadoop - Hadoop copies etc. Gobblin also
> >> supports both Hadoop and non-Hadoop data, being able to ingest data into
> >> Kafka as well as other key-value stores like Couchbase. Gobblin is also
> not
> >> just a generic computation framework, it has specific constructs for
> data
> >> integration patterns such as data quality metrics and policies.
> Gobblin’s
> >> configuration management system allows it to be fully multi-tenant and
> take
> >> advantage of grouped policies when required. For batch workloads,
> Gobblin
> >> has a planning phase that provides for better resource utilization.
> >>
> >> In summary, there is healthy diversity in the number of systems
> >> approaching the interesting and pressing problem of big data
> integration.
> >> We believe that Gobblin will provide another compelling choice in that
> >> design space.
> >>
> >> === An Excessive Fascination with the Apache Brand ===
> >>
> >> Gobblin is already a healthy and well known open source project. This
> >> proposal is not for the purpose of generating publicity. Rather, the
> >> primary benefits to joining Apache are already outlined in the Rationale
> >> section.
> >>
> >> == Documentation ==
> >>
> >> The reader will find these websites highly relevant:
> >> * Website: http://linkedin.github.io/gobblin/
> >> * Documentation: https://gobblin.readthedocs.io/en/latest/
> >> * Codebase: https://github.com/linkedin/gobblin/
> >> * User group: https://groups.google.com/forum/#!forum/gobblin-users
> >>
> >> == Source and Intellectual Property Submission Plan ==
> >>
> >> The Gobblin codebase is currently hosted on Github. This is the exact
> >> codebase that we would migrate to the Apache foundation.The Gobblin
> source
> >> code is already licensed under Apache License Version 2.0. Going
> forward,
> >> we will continue to have all the contributions licensed directly to the
> >> Apache foundation through our signed Individual Contributor License
> >> Agreements for all the committers on the project.
> >>
> >> == External Dependencies ==
> >>
> >> To the best of our knowledge, all of Gobblin dependencies are
> distributed
> >> under Apache compatible licenses. Upon acceptance to the incubator, we
> >> would begin a thorough analysis of all transitive dependencies to verify
> >> this fact and introduce license checking into the build and release
> process
> >> (for instance integrating Apache Rat).
> >>
> >> == Cryptography ==
> >>
> >> We do not expect Gobblin to be a controlled export item due to the use
> of
> >> encryption.
> >>
> >> == Required Resources ==
> >>
> >> === Mailing lists ===
> >>
> >> * gobblin-user
> >> * gobblin-dev
> >> * gobblin-commits
> >> * gobblin-private for private PMC discussions (with moderated
> >> subscriptions)
> >>
> >> === Subversion Directory ===
> >>
> >> Git is the preferred source control system: git://
> git.apache.org/gobblin
> >>
> >> === Issue Tracking ===
> >>
> >> JIRA Gobblin (GOBBLIN)
> >>
> >> === Other Resources ===
> >>
> >> The existing code already has unit and integration tests, so we would
> >> like a Jenkins instance to run them whenever a new patch is submitted.
> This
> >> can be added after project creation.
> >>
> >> == Initial Committers ==
> >>
> >> * Abhishek Tiwari <abhishektiwari dot btech at gmail dot com>
> >> * Shirshanka Das <shirshanka at apache dot org>
> >> * Chavdar Botev <cbotev at gmail dot com>
> >> * Sahil Takiar <takiar.sahil at gmail dot com>
> >> * Yinan Li <liyinan926 at gmail dot com>
> >> * Ziyang Liu <>
> >> * Lorand Bendig <lbendig at gmail dot com>
> >> * Issac Buenrostro <ibuenros at linkedin dot com>
> >> * Hung Tran <hutran at linkedin dot com>
> >> * Olivier Lamy <olamy at apache dot org>
> >> * Jean-Baptiste Onofré <jbonofre@apache.org>
> >>
> >> == Affiliations ==
> >>
> >> * Abhishek Tiwari - LinkedIn
> >> * Shirshanka Das - LinkedIn
> >> * Chavdar Botev - Stealth Startup
> >> * Sahil Takiar - Cloudera
> >> * Yinan Li - Google
> >> * Ziyang Liu - Facebook
> >> * Lorand Bendig - Swisscom
> >> * Issac Buenrostro - LinkedIn
> >> * Hung Tran - LinkedIn
> >> * Olivier Lamy - Webtide
> >> * Jean-Baptiste Onofre - Talend
> >>
> >> == Sponsors ==
> >>
> >> === Champion ===
> >>
> >> Olivier Lamy < olamy at apache dot org>
> >>
> >> === Nominated Mentors ===
> >>
> >> * Olivier Lamy <olamy at apache dot org>
> >> * Jean-Baptiste Onofre <jbonofre at apache dot org>
> >> * ?
> >> * ?
> >>
> >> == Sponsoring Entity ==
> >> The Apache Incubator
> >>
> >
> >
> >
> > --
> > Olivier Lamy
> > http://twitter.com/olamy | http://linkedin.com/in/olamy
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>


-- 
Olivier Lamy
http://twitter.com/olamy | http://linkedin.com/in/olamy

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message