incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "HudiProposal" by ThomasWeise
Date Sun, 13 Jan 2019 21:52:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "HudiProposal" page has been changed by ThomasWeise:
https://wiki.apache.org/incubator/HudiProposal?action=diff&rev1=7&rev2=8

+ = Hudi Proposal =
+ 
  == Abstract ==
+ 
  Hudi is a big-data storage library, that provides atomic upserts and incremental data streams.
  
  Hudi manages data stored in Apache Hadoop and other API compatible distributed file systems/cloud
stores. 
  
- = Proposal =
+ == Proposal ==
  
  Hudi provides the ability to atomically upsert datasets with new values in near-real time,
making data available quickly to existing query engines like Apache Hive, Apache Spark, &
Presto. Additionally, Hudi provides a sequence of changes to a dataset from a given point-in-time
to enable incremental data pipelines that yield greater efficiency & latency than their
typical batch counterparts. By carefully managing number of files & sizes, Hudi greatly
aids both query engines (e.g: always providing well-sized files) and underlying storage (e.g:
HDFS NameNode memory consumption). 
  
  Hudi is largely implemented as an Apache Spark library that reads/writes data from/to Hadoop
compatible filesystem. SQL queries on Hudi datasets are supported via specialized Apache Hadoop
input formats, that understand Hudi’s storage layout. Currently, Hudi manages datasets using
a combination of Apache Parquet & Apache Avro file/serialization formats.
- 
  
  == Background ==
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message