hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pratyaksh Sharma <pratyaks...@gmail.com>
Subject Re: [Hudi Improvement]: Modification of partition path format to support simplified queries
Date Wed, 21 Aug 2019 11:47:32 GMT
Hi Vinoth/Balaji,

I am able to solve my use case using TimestampBasedKeyGenerator as the
KeyGenerator. Thank you for suggesting the hook.

On Sat, Aug 17, 2019 at 2:23 PM Pratyaksh Sharma <pratyaksh13@gmail.com>
wrote:

> Hi Vinoth,
>
> I am travelling right now with limited access to internet. Will check and
> update you on Monday.
>
> On Thu, Aug 15, 2019, 10:09 AM Vinoth Chandar <vinoth@apache.org> wrote:
>
>> Hi,
>>
>> Do these hooks seem sufficient to support what you are looking for?
>>
>> On Tue, Aug 13, 2019 at 8:16 PM vbalaji@apache.org <vbalaji@apache.org>
>> wrote:
>>
>> >
>> > Hi Pratyaksh,
>> > The partitioning format is pluggable in Hudi.
>> > 1. For Hudi Writing, you can simply use one of the several
>> implementations
>> > of org.apache.hudi.KeyGenerator or write your own implementation to
>> control
>> > partition path format. You can configure partition-path using
>> >
>> https://hudi.incubator.apache.org/configurations.html#KEYGENERATOR_CLASS_OPT_KEY
>> > 2. For Hive Syncing, there are again some default implementations for
>> > org.apache.hudi.hive.PartitionValueExtractor. You can also write your
>> > custom partition value extractor and configure using
>> >
>> https://hudi.incubator.apache.org/configurations.html#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY
>> >
>> > Thanks,Balaji.V    On Tuesday, August 13, 2019, 03:23:57 AM PDT,
>> Pratyaksh
>> > Sharma <pratyaksh13@gmail.com> wrote:
>> >
>> >  Hi,
>> >
>> > I have been working on Hudi for sometime and have an improvement
>> > suggestion.
>> >
>> > When we build a CDC pipeline, generally the field used for partitioning
>> is
>> > date (created_at), and the general format of created_at is yyyy-MM-dd
>> > HH:mm:ss.S. If we have this field formatted to yyyy/MM/dd, then your
>> hive
>> > queries for fetching data between any two dates become much complex,
>> which
>> > is the usual case. For example,
>> >
>> > 1. If the partitions are in format yyyy/MM/dd, then query to select data
>> > for all days between 2015-01-01 and 2015-03-01 would look like,
>> >
>> > SELECT * FROM db.table where year=2015 and ((month=01 or month=02) or
>> > (month=03 and day=01))
>> >
>> > 2. Instead if partitions are in the format yyyy-MM-dd or yyyymmdd it
>> > supports direct queries on the data.
>> > e.g the above mentioned query would look like,
>> >
>> > SELECT * from db.table where DateStamp between ‘2015-01-01’ and
>> > ‘2015-03-01’.
>> >
>> >
>> > Reference -
>> >
>> https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
>> > <
>> >
>> https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
>> > >
>> >
>> > The proposal is to make the default partitioning to yyyy-mm-dd OR at
>> least
>> > provide a provision to change the format.
>> >
>> > Please suggest on the above. Please find the jira raised here <
>> > https://issues.apache.org/jira/browse/HUDI-206> (HUDI-206).
>> >
>> >
>> > Regards,
>> > Pratyaksh
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message