From commits-return-12439-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Mon Mar  2 19:38:46 2020
Return-Path: <commits-return-12439-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A11DC18062C
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  2 Mar 2020 20:38:46 +0100 (CET)
Received: (qmail 38484 invoked by uid 500); 2 Mar 2020 19:38:46 -0000
Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:commits-help@hudi.apache.org>
List-Unsubscribe: <mailto:commits-unsubscribe@hudi.apache.org>
List-Post: <mailto:commits@hudi.apache.org>
List-Id: <commits.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list commits@hudi.apache.org
Received: (qmail 38475 invoked by uid 99); 2 Mar 2020 19:38:46 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Mar 2020 19:38:46 +0000
From: GitBox <git@apache.org>
To: commits@hudi.apache.org
Subject: [GitHub] [incubator-hudi] garyli1019 commented on issue #1362: HUDI-644
 Enable user to get checkpoint from previous commits in DeltaStreamer
Message-ID: <158317792601.11566.9405297725767258877.gitbox@gitbox.apache.org>
References: <hudi.1362.MDExOlB1bGxSZXF1ZXN0MzgxMTM0MTI1.gitbox@gitbox.apache.org>
In-Reply-To: <hudi.1362.MDExOlB1bGxSZXF1ZXN0MzgxMTM0MTI1.gitbox@gitbox.apache.org>
Date: Mon, 02 Mar 2020 19:38:46 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593579929
 
 
   @pratyakshsharma Thanks for reviewing this PR.
   I can say more about my use cases:
   
   - I am using Kafka connect to sink Kafka to HDFS every 30 minutes, partitioned by the arrival time(year=xx/month=xx/day=xx/hour=xx)
   - The homebrew spark Datasource I built is using (the current time - the last Hudi commit timestamp) to find a `timeWindow` and use it to load the data generated by Kafka connect.
   - I will be easy to switch to delta streamer with DFS source, but a little bit tricky to switch to Kafka, because there is a delay caused by the Kafka connect. 
   
   So right now, if I switch to delta streamer directly ingesting from Kafka, I will start from the `LATEST` checkpoint, and `EARLIEST` is not possible because my Kafka cluster retention time is pretty long. 
   The data loss I mentioned in the few hours gap between the commit I switched from my homebrew Spark data source reader to delta streamer. All I need to do is in this commit, I run the delta streamer first to store the `LATEST` checkpoint, then run my data source reader to read the data in those few hours gap. Only need one parallel run here then I will be good to go with the delta streamer.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services