hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinoth Chandar (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (HUDI-538) Restructuring hudi client module for multi engine support
Date Sat, 18 Jan 2020 18:06:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018664#comment-17018664
] 

Vinoth Chandar edited comment on HUDI-538 at 1/18/20 6:05 PM:
--------------------------------------------------------------

+1 [~yanghua] , I added a second task for moving classes around based on your changes..

Core issue we need a solution for IMO is the following .. (if we solve this, rest is more
or less easy)  I will illustrate using Spark (since my understanding of Flink is somewhat
limited atm) ..

 

So,  even for Spark I would like the writing to be done via _RDD_ or _DataFrame_ routes and
the current code converts the dataframe into RDDs to perform writes. This has some performance
side-effects (suprisingly, :P) 

 

1) If you take a single class like _HoodieWriteClient_, then it currently does something like
`hoodieRecordRDD.map().sort()` internally.. if we want to support Flink DataStream or Spark
DataFrame as the object, then we need to somehow define an abstraction like `HoodieExecutionContext<T>` 
which will have a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition()
methods? There will be subclasses like _HoodieSparkRDDExecutionContext<JavaRDD>,_ _HoodieSparkDataFrameExecutionContext<DataFrame>_,
_HoodieFlinkDataStreamExecutionContext<DataStream>_ which will implement them in engine
specific ways and hand back the transformed T object? 

 

2) Right now, we work with _HoodieRecord_, as the record level abstraction.. i.e we eagerly
parse the input into a HoodieKey (String recordKey, String partitionPath) and HoodieRecordPayload.
The key is needed during indexing, and the payload is needed to precombine duplicates within
a batch (may be spark specific)/combine incoming record with whats stored in the table during
writing.. We need a way to do these lazily by pushing the key extraction function into the
entire writing path. 

 

I think we should deeply think about these issues.. have concrete approaches before we embark
more deeply.. We will hit these issues.. 

 

 

 


was (Author: vc):
+1 [~yanghua] , I added a second task for moving classes around based on your changes..

Core issue we need a solution for IMO is the following .. (if we solve this, rest is more
or less easy)  I will illustrate using Spark (since my understanding of Flink is somewhat
limited atm) ..

 

So,  even for Spark I would like the writing to be done via _RDD_ or _DataFrame_ routes and
the current code converts the dataframe into RDDs to perform writes. This has some performance
side-effects (suprisingly, :P) 

 

1) If you take a single class like _HoodieWriteClient_, then it currently does something like
`hoodieRecordRDD.map().sort()` internally.. if we want to support Flink DataStream or Spark
DataFrame as the object, then we need to somehow define an abstraction like `HoodieExecutionContext<T>` 
which will have a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition()
methods? There will be subclasses like _HoodieSparkRDDExecutionContext<JavaRDD>,_ _HoodieSparkDataFrameExecutionContext<DataFrame>_,
_HoodieFlinkDataStreamExecutionContext<DataStream>_ which will implement them in engine
specific ways and hand back the transformed T object? 

 

2) Right now, we work with _HoodieRecord_, as the record level abstraction.. i.e we eagerly
parse the input into a HoodieKey (String recordKey, String partitionPath) and HoodieRecordPayload.
The key is needed during indexing, and the payload is needed to precombine duplicates within
a batch (may be spark specific)/combine incoming record with whats stored in the table during
writing.. We need a way to do these lazily by pushing the key extraction function into the
entire writing path. 

 

I think we should deeply think about these issues.. have concrete approaches before we embark
more deeply.. We will hit these issues.. 

 

 

 

 

 

 

 

 

 

> Restructuring hudi client module for multi engine support
> ---------------------------------------------------------
>
>                 Key: HUDI-538
>                 URL: https://issues.apache.org/jira/browse/HUDI-538
>             Project: Apache Hudi (incubating)
>          Issue Type: Wish
>          Components: Code Cleanup
>            Reporter: vinoyang
>            Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the integration
with other computing engine more difficult. We plan to decouple it with Spark. This umbrella
issue used to track this work.
> Some thoughts wrote here: https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message