hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lamber-ken (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner
Date Thu, 27 Feb 2020 05:09:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046166#comment-17046166
] 

lamber-ken commented on HUDI-635:
---------------------------------

hi [~vinoth], here are my initial thoughts, may be not correct

Can we try to replace RDD<HoodieRecord> with Dataset<Row> ? RDD don't have inbuilt
optimization engine. When working with structured data, RDDs cannot take advantages of sparks
advance optimizers. 
 * When upsert data to HUDI, convert to avro data .. many convert operations may cost more
time
 * DataFrame support add column, additional columns (_hoodie_commit_time, _hoodie_commit_seqno,
_hoodie_record_key)

Also, we can expose Row to users( instead of GenericRecord ) in payload, use can use methods
like getString getDate .. etc, which are more friendly.

If use GenericRecord, user need to care about the schema and convert data from bytes.

WDYT? :)

 

> MergeHandle's DiskBasedMap entries can be thinner
> -------------------------------------------------
>
>                 Key: HUDI-635
>                 URL: https://issues.apache.org/jira/browse/HUDI-635
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> Instead of <Key, HoodieRecord>, we can just track <Key, Payload> ... Helps
with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message