hudi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From leesf <>
Subject Re: Apache Hudi 0.7.0 Released
Date Wed, 27 Jan 2021 10:59:23 GMT
Thanks Vinoth for driving this major release and everyone involved.

Vinoth Chandar <> 于2021年1月27日周三 上午6:33写道:

> Hello all,
> We are excited to share that the 0.7.0 release is out, and by far our
> biggest release with lots of code moving around, new unique features and
> bug fixes.
> Please find more information here and provide feedback
> Few quick highlights:
> Clustering:P <>0.7.0
> brings the ability to cluster your Hudi tables, to optimize for file sizes
> and also storage layout. Hudi will continue to enforce file sizes, as it
> always has been, during the write. Clustering provides more flexibility to
> increase the file sizes down the line or ability to ingest data at much
> fresher intervals, and later coalesce them into bigger files. This is
> very similar to the benefits of clustering delivered by cloud data
> warehouses
> <>.
> We are proud to announce that such capability is freely available in open
> source, for the first time, through the 0.7.0 release.Metadata Table: 0.7.0
> lays out the foundation for storing more indexes, metadata in an internal
> metadata table, which is implemented using a Hudi MOR table - which means
> it’s compacted, cleaned and also incrementally updated like any other Hudi
> table. By hoodie.metadata.enable=true on the writer side, will populate
> the metadata table with file system listings so all operations don’t have
> to explicitly use fs.listStatus() anymore on data partitions. In our
> testing, on a large 250K file table, the metadata table delivers 2-3x
> speedup <> over
> parallelized listing done by the Hudi spark writer.
> Users can also leverage the metadata table on the query side for the
> following query paths. For Hive, setting the hoodie.metadata.enable=true session
> property and for SparkSQL on Hive registered tables using --conf
> spark.hadoop.hoodie.metadata.enable=true, allows the file listings for
> the partition to be fetched out of the metadata table, instead of listing
> the underlying DFS partition. More engines are coming.
> Java/Flink Writers: In 0.7.0, we have additionally added Java and Flink
> based writers, as initial steps. Specifically, the HoodieFlinkStreamer allows
> for Hudi Copy-On-Write table to be built by streaming data from a Kafka
> topic.
> *Spark3 Support*: We have added support for writing/querying data using
> Spark 3. please be sure to use the scala 2.12 hudi-spark-bundle.
> *Insert Overwrite/Insert Overwrite Table*: We have added these two new
> write operation types, predominantly to help existing batch ETL jobs, which
> typically overwrite entire tables/partitions each run. These operations are
> much cheaper than having to issue upserts, given they are bulk replacing
> the target table. Check here
> <> for
> examples.
> *Incremental Query on MOR (Spark Datasource)*: Spark datasource now has
> experimental support for incremental queries on MOR table. This feature
> will be hardened and certified in the next release, along with a large
> overhaul of the spark datasource implementation. (sshh!:))
> Thanks,
> Vinoth
> (on behalf of the Hudi Community)

View raw message