hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer
Date Tue, 03 Mar 2020 07:48:43 GMT
pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous
commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593813910
 
 
   @garyli1019 still I feel all these challenges are arising because you are trying to ingest
data in the same dataset using 2 different spark jobs. Few questions - 
   
   1. If the kafka cluster retention time is too long, have you tried using BULK_INSERT mode
of Hudi?If not, you can tune parameters around spark and Hudi to increase source limit and
then ingest the data. Else you can also try using DeltaStreamer in continuous mode. 
   2. Also I would like to know the reason behind switching everytime from homebrew spark
to Hudi. Are you doing some POC on Hudi? Why don't you simply use DeltaStreamer and never
switch to the other data source? The data loss will not happen if you simply rely on one of
the data sources :) 
   
   I am a bit skeptical of trying to use 2 pipelines to write to same destination path. Additionally
we have options available for taking backup of your hudi dataset or for migrating existing
dataset to Hudi. Anyways if you strongly feel the need to write this checkPointGenerator,
let us hear the opinion of @leesf and @vinothchandar as well on this before proceeding. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message