hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer
Date Tue, 03 Mar 2020 18:24:23 GMT
garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits
in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-594096377
 
 
   @pratyakshsharma So let's forget about my homebrew Spark data source reader. Let's assume
I am using delta streamer consuming DFS source, now I'd like to switch to delta streamer consuming
Kafka source. The data arrive at DFS and Kafka is asynchronous. DFS source has 30 minutes
delay from Kafka.  
   So basically I'd like to switch from: **Kafka -> HDFS raw parquet -> Hudi table**
to **Kafka -> Hudi table**. If you have a good solution for this case please let me know.

   
   - The problem I have here is Kafka retention time is long but not long enough to cover
all the data. All the raw data I have is in DFS and they are keep coming in. If I simply do
BULK_INSERT from EARLIEST checkpoint from Kafka, I will lose data. If I do HDFS import first,
then UPSERT from:
      the EARLIEST checkpoint, it could eat up the resources of both my Spark cluster and
Kafka cluster because the data volume is huge. 
      the LATEST checkpoint, I will lose data(30 mins gap). 
   - There are some Hudi users are not using Delta Streamer in the first place and would like
to switch to it later I believe. And I am one of them. Cause form a user perspective, I won't
fully trust a framework until I fully understand and gain enough experience with it.   
   
   Currently, I couldn't find a perfect way to switch to delta streamer cause:
   I need to make a non-deltastreamer commit to append the gap data into the Hudi dataset
but this commit will let me lose the checkpoint. Let's not say this is a parallel pipeline
cause it's confusing. This is a one-time thing to fix the data gap from two different sources
and the delta streamer will be the only one to do the sink later. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message