hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] pratyakshsharma edited a comment on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer
Date Wed, 04 Mar 2020 09:18:01 GMT
pratyakshsharma edited a comment on issue #1362: HUDI-644 Enable user to get checkpoint from
previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-594409237
 
 
   Let me put forward my viewpoint on this. When I was in the phase of adopting Hudi, I kept
my already running pipeline writing to some path and started DeltaStreamer to write to some
other path. Then I used to do validation everyday for some period of time to gain enough confidence
on this framework before completely switching to Hudi. 
   
   Coming to your point of switching from Kafka -> HDFS raw parquet -> Hudi table to
Kafka -> Hudi table, I was thinking of a similar use case some time back and the simplest
thing I could think of was to support having checkpoints for Hudi dataset source wise. Currently
we store checkpoint "deltastreamer.checkpoint.key" in .commit file and this variable stores
checkpoint in a particular format for every source which creates problems when you try to
switch your source for the same dataset. So I think if we could simply introduce more variables
like this and each one of them will store checkpoints for their corresponding sources, this
use case can be solved with minimal efforts. And yes this needs development cycle since what
I am proposing is not supported as of now. WDYT? 
   
   Currently to handle such scenarios, we have "deltastreamer.checkpoint.reset_key" configurable
for every DeltaStreamer run and you can do hacks around these two variables ("deltastreamer.checkpoint.key"
and "deltastreamer.checkpoint.reset_key") to get your use case solved but a clean solution
should be what I proposed above. The above solution works well in cases where you want to
switch sources quite frequently also.
   
   Also would like to hear from @leesf and @vinothchandar on this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message