incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "HudiProposal" by ThomasWeise
Date Sun, 13 Jan 2019 21:50:38 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "HudiProposal" page has been changed by ThomasWeise:
https://wiki.apache.org/incubator/HudiProposal?action=diff&rev1=6&rev2=7

  Hudi is largely implemented as an Apache Spark library that reads/writes data from/to Hadoop
compatible filesystem. SQL queries on Hudi datasets are supported via specialized Apache Hadoop
input formats, that understand Hudi’s storage layout. Currently, Hudi manages datasets using
a combination of Apache Parquet & Apache Avro file/serialization formats.
  
  
- === Background ===
+ == Background ==
  
  Apache Hadoop distributed filesystem (HDFS) & other compatible cloud storage systems
(e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as longer term analytical storage for
thousands of organizations. Typical analytical datasets are built by reading data from a source
(e.g: upstream databases, messaging buses, or other datasets), transforming the data, writing
results back to storage, & making it available for analytical queries--all of this typically
accomplished in batch jobs which operate in a bulk fashion on partitions of datasets. Such
a style of processing typically incurs large delays in making data available to queries as
well as lot of complexity in carefully partitioning datasets to guarantee latency SLAs.  
  
@@ -22, +22 @@

  
  Hudi was originally developed at Uber (original name “Hoodie”) to address such broad
inefficiencies in ingest & ETL & ML pipelines across Uber’s data ecosystem that
required the upsert & incremental consumption primitives supported by Hudi.
  
- === Rationale ===
+ == Rationale ==
  
  We truly believe the capabilities supported by Hudi would be increasingly useful for big-data
ecosystems, as data volumes & need for faster data continue to increase. A detailed description
of target use-cases can be found at https://uber.github.io/hudi/use_cases.html. 
  
  Given our reliance on so many great Apache projects, we believe that the Apache way of open
source community driven development will enable us to evolve Hudi in collaboration with a
diverse set of contributors who can bring new ideas into the project.
  
- === Initial Goals ===
+ == Initial Goals ==
  
   * Move the existing codebase, website, documentation, and mailing lists to an Apache-hosted
infrastructure.
   * Integrate with the Apache development process.
   * Ensure all dependencies are compliant with Apache License version 2.0.
   * Incrementally develop and release per Apache guidelines.
  
- === Current Status ===
+ == Current Status ==
  Hudi is a stable project used in production at Uber since 2016 and was open sourced under
the Apache License, Version 2.0 in 2017. At Uber, Hudi manages 4000+ tables holding several
petabytes, bringing our Hadoop warehouse from several hours of data delays to under 30 minutes,
over the past two years. The source code is currently hosted at github.com (https://github.com/uber/hudi
), which will seed the Apache git repository.
  
+ === Meritocracy ===
- 
-  * '''Meritocracy:'''
  
  We are fully committed to open, transparent, & meritocratic interactions with our community.
In fact, one of the primary motivations for us to enter the incubation process is to be able
to rely on Apache best practices that can ensure meritocracy. This will eventually help incorporate
the best ideas back into the project & enable contributors to continue investing their
time in the project. Current guidelines (https://uber.github.io/hudi/community.html#becoming-a-committer)
have already put in place a meritocratic process which we will replace with Apache guidelines
during incubation.
  
-  * '''Community:'''
+ === Community ===
+ 
  Hudi community is fairly young, since the project was open sourced only in early 2017. Currently,
Hudi has committers from Uber & Snowflake. We have a vibrant set of contributors (~46
members in our slack channel) including Shopify, DoubleVerify and Vungle & others, who
have either submitted patches or filed issues with hudi pipelines either in early production
or testing stages. Our primary goal during the incubation would be to grow the community and
groom our existing active contributors into committers.
  
-  * '''Core Developers:'''
+ === Core Developers ===
  
  Current core developers work at Uber & Snowflake. We are confident that incubation will
help us grow a diverse community in a open & collaborative way.
  
-  * '''Alignment:'''
+ === Alignment ===
  
  Hudi is designed as a general purpose analytical storage abstraction that integrates with
multiple Apache projects: Apache Spark, Apache Hive, Apache Hadoop. It was built using multiple
Apache projects, including Apache Parquet and Apache Avro, that support near-real time analytics
right on top of existing Apache Hadoop data lakes. Our sincere hope is that being a part of
the Apache foundation would enable us to drive the future of the project in alignment with
the other Apache projects for the benefit of thousands of organizations that already leverage
these projects.
  
- === Known Risks ===
+ == Known Risks ==
  
-  * '''Orphaned products''':
+ === Orphaned products ===
  
  The risk of abandonment of Hudi is low. It is used in production at Uber for petabytes of
data and other companies (mentioned in community section) are either evaluating or in the
early stage for production use. Uber is committed to further development of the project and
invest resources towards the Apache processes & building the community, during incubation
period.
  
-  * '''Inexperience with Open Source:'''
+ === Inexperience with Open Source ===
  
  Even though the initial committers are new to the Apache world, some have considerable open
source experience - Vinoth Chandar (Linkedin voldemort, Chromium), Prasanna Rajaperumal (Cloudera
experience), Zeeshan Qureshi (Chromium) & Balaji Varadarajan (Linkedin Databus). We have
been successfully managing the current open source community answering questions and taking
feedback already. Moreover, we hope to obtain guidance and mentorship from current ASF members
to help us succeed with the incubation. 
  
-  * '''Length of Incubation:'''
+ === Length of Incubation ===
  
  We expect the project be in incubation for 2 years or less. 
  
-  * '''Homogenous Developers:'''
+ === Homogenous Developers ===
  
  Currently, the lead developers for Hudi are from Uber. However, we have an active set of
early contributors/collaborators from Shopify, DoubleVerify and Vungle, that we hope will
increase the diversity going forward. Once again, a primary motivation for incubation is to
facilitate this in the Apache way.
  
-  * '''Reliance on Salaried Developers''':
+ === Reliance on Salaried Developers ===
  
  Both the current committers & early contributors have several years of core expertise
around data systems. Current committers are very passionate about the project and have already
invested hundreds of hours towards helping & building the community. Thus, even with employer
changes, we expect they will be able to actively engage in the project either because they
will be working in similar areas even with newer employers or out of belief in the project.
  
-  * '''Relationships with Other Apache Products:'''
+ === Relationships with Other Apache Products ===
  
  To the best of our knowledge, there are no direct competing projects with Hudi that offer
all of the feature set namely - upserts, incremental streams, efficient storage/file management,
snapshot isolation/rollbacks - in a coherent way. However, some projects share common goals
and technical elements and we will highlight them here. Hive ACID/Kudu both offer upsert capabilities
without storage management/incremental streams. The recent Iceberg project offers similar
snapshot isolation/rollbacks, but not upserts or other data plane features. A detailed comparison
with their trade-offs can be found at https://uber.github.io/hudi/comparison.html.
  
  We are committed to open collaboration with such Apache projects and incorporate changes
to Hudi or contribute patches to other projects, with the goal of making it easier for the
community at large, to adopt these open source technologies.
  
-  * '''A Excessive Fascination with the Apache Brand:'''
+ === Excessive Fascination with the Apache Brand ===
  
  This proposal is not for the purpose of generating publicity. We have already been doing
talks/meetups independently that have helped us build our community. We are drawn towards
Apache as a potential way of ensuring that our open source community management is successful
early on so  hudi can evolve into a broadly accepted--and used--method of managing data on
Hadoop.
  
- = Documentation =
+ == Documentation ==
  [1] Detailed documentation can be found at https://uber.github.io/hudi/ 
  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message