From commits-return-12561-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Wed Mar 4 09:17:46 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 870B018063F for ; Wed, 4 Mar 2020 10:17:46 +0100 (CET) Received: (qmail 38849 invoked by uid 500); 4 Mar 2020 09:17:46 -0000 Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list commits@hudi.apache.org Received: (qmail 38840 invoked by uid 99); 4 Mar 2020 09:17:45 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2020 09:17:45 +0000 From: GitBox To: commits@hudi.apache.org Subject: [GitHub] [incubator-hudi] pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer Message-ID: <158331346584.4169.6552583421694703519.gitbox@gitbox.apache.org> References: In-Reply-To: Date: Wed, 04 Mar 2020 09:17:45 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-594409237 Let me put forward my viewpoint on this. When I was the phase of adopting Hudi, I kept my already running pipeline writing to some path and started DeltaStreamer to write to some other path. Then I used to do validation everyday for some period of time to gain enough confidence on this framework before completely switching to Hudi. Coming to your point of switching from Kafka -> HDFS raw parquet -> Hudi table to Kafka -> Hudi table, I was thinking of a similar use case some time back and the simplest thing I could think of was to support having checkpoints for Hudi dataset source wise. Currently we store checkpoint "deltastreamer.checkpoint.key" in .commit file and this variable stores checkpoint in a particular format for every source which creates problems when you try to switch your source for the same dataset. So I think if we could simply introduce more variables like this and each one of them will store checkpoints for their corresponding sources, this use case can be solved with minimal efforts. And yes this needs development cycle since what I am proposing is not supported as of now. WDYT? Currently to handle such scenarios, we have "deltastreamer.checkpoint.reset_key" configurable for every DeltaStreamer run and you can do hacks around these two variables ("deltastreamer.checkpoint.key" and "deltastreamer.checkpoint.reset_key") to get your use case solved but a clean solution should be what I proposed above. The above solution works well in cases where you want to switch sources quite frequently also. Also would like to hear from @leesf and @vinothchandar on this. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services