hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinoth Chandar (Jira)" <>
Subject [jira] [Commented] (HUDI-480) Support a querying delete data methond in incremental view
Date Mon, 23 Mar 2020 02:33:00 GMT


Vinoth Chandar commented on HUDI-480:

This needs an RFC to drive the design first.. At a high level, it seems like with some additional
cleaner retention and some merging of old and current file slices, we should be able to do
something of this sort. 

> Support a querying delete data methond in incremental view
> ----------------------------------------------------------
>                 Key: HUDI-480
>                 URL:
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Incremental Pull
>            Reporter: cdmikechen
>            Priority: Minor
> As we known, hudi have supported many method to query data in Spark and Hive and Presto.
And it also provides a very good timeline idea to trace changes in data, and it can be used
to query incremental data in incremental view.
> In old time, we just have insert and update funciton to upsert data, and now we have
added new functions to delete some existing data.
> *[HUDI-328] Adding delete api to HoodieWriteClient*
> *[HUDI-377] Adding Delete() support to DeltaStreamer**
> So I think if we have delete api, should we add another method to get deleted data in
incremental view?
> I've looked at the methods for generating new parquet files. I think the main idea is
to combine old and new data, and then filter the data which need to be deleted, so that the
deleted data does not exist in the new dataset. However, in this way, the data to be deleted
will not be retained in new dataset, so that only the inserted or modified data can be found
according to the existing timestamp field during data tracing in incremental view.
> If we can do it, I feel that there are two ideas to consider:
> 1. Trace the dataset in the same file at different time check points according to the
timeline, compare the two datasets according to the key and filter out the deleted data. This
method does not consume extra when writing, but it needs to call the analysis function according
to the actual request during query, which consumes a lot.
> 2. When writing data, if there is any deleted data, we will record it. File name such
as *.delete_filename_version_timestamp*. So that we can immediately give feedback according
to the time. But additional processing will be done at the time of writing.

This message was sent by Atlassian Jira

View raw message