hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raymond Xu (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-499) Allow partition path to be updated with GLOBAL_BLOOM index
Date Sat, 04 Jan 2020 17:46:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008106#comment-17008106
] 

Raymond Xu commented on HUDI-499:
---------------------------------

[~shivnarayan] [~vinoth] 

Would appreciate to get some feedback on the implementation for this option
 # I suppose [this is the method|https://github.com/apache/incubator-hudi/blob/a733f4ef723865738d8541282c0c7234d64668db/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L107]
to be changed to generate to-be-deleted records and to-be-inserted records. A flatMap would
be needed.
 # How do we generally do deletion? Is it equivalent to an empty record to be inserted?

> Allow partition path to be updated with GLOBAL_BLOOM index
> ----------------------------------------------------------
>
>                 Key: HUDI-499
>                 URL: https://issues.apache.org/jira/browse/HUDI-499
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Raymond Xu
>            Priority: Major
>
> h3. Context
> When a record is to be updated with a new partition path, and when set to GLOBAL_BLOOM
as index, the current logic implemented in [https://github.com/apache/incubator-hudi/pull/1091/]
ignores the new partition path and update the record in the original partition path.
> h3. Proposed change
> Allow records to be inserted into their new partition paths and delete the records in
the old partition paths. A configuration (e.g. {{hoodie.index.bloom.update.partitionpath=true}})
can be added to enable this feature.
> h4. An example use case
> A Hudi dataset manages people info and partitioned by birthday. In most cases, where
people info are updated, birthdays are not to be changed (that's why we choose it as partition
field). But in some edge cases where birthday info are input wrongly and we want to manually
fix it or allow user to updated it occasionally. In this case, option 2 would be helpful in
keeping records in the expected partition, so that a query like "show me people who were born
after 2000" would work.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message