incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "HudiProposal" by ThomasWeise
Date Sun, 13 Jan 2019 21:52:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "HudiProposal" page has been changed by ThomasWeise:

+ = Hudi Proposal =
  == Abstract ==
  Hudi is a big-data storage library, that provides atomic upserts and incremental data streams.
  Hudi manages data stored in Apache Hadoop and other API compatible distributed file systems/cloud
- = Proposal =
+ == Proposal ==
  Hudi provides the ability to atomically upsert datasets with new values in near-real time,
making data available quickly to existing query engines like Apache Hive, Apache Spark, &
Presto. Additionally, Hudi provides a sequence of changes to a dataset from a given point-in-time
to enable incremental data pipelines that yield greater efficiency & latency than their
typical batch counterparts. By carefully managing number of files & sizes, Hudi greatly
aids both query engines (e.g: always providing well-sized files) and underlying storage (e.g:
HDFS NameNode memory consumption). 
  Hudi is largely implemented as an Apache Spark library that reads/writes data from/to Hadoop
compatible filesystem. SQL queries on Hudi datasets are supported via specialized Apache Hadoop
input formats, that understand Hudi’s storage layout. Currently, Hudi manages datasets using
a combination of Apache Parquet & Apache Avro file/serialization formats.
  == Background ==

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message