hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lamber-ken (Jira)" <>
Subject [jira] [Commented] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner
Date Thu, 27 Feb 2020 05:09:00 GMT


lamber-ken commented on HUDI-635:

hi [~vinoth], here are my initial thoughts, may be not correct

Can we try to replace RDD<HoodieRecord> with Dataset<Row> ? RDD don't have inbuilt
optimization engine. When working with structured data, RDDs cannot take advantages of sparks
advance optimizers. 
 * When upsert data to HUDI, convert to avro data .. many convert operations may cost more
 * DataFrame support add column, additional columns (_hoodie_commit_time, _hoodie_commit_seqno,

Also, we can expose Row to users( instead of GenericRecord ) in payload, use can use methods
like getString getDate .. etc, which are more friendly.

If use GenericRecord, user need to care about the schema and convert data from bytes.

WDYT? :)


> MergeHandle's DiskBasedMap entries can be thinner
> -------------------------------------------------
>                 Key: HUDI-635
>                 URL:
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
> Instead of <Key, HoodieRecord>, we can just track <Key, Payload> ... Helps
with use-cases like HUDI-625

This message was sent by Atlassian Jira

View raw message