hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lamber-ken (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (HUDI-353) Add support for Hive style partitioning path
Date Wed, 04 Mar 2020 07:10:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

lamber-ken resolved HUDI-353.
-----------------------------
    Resolution: Resolved

Fixed at master e555aa516de867a4faf0322e79defa1f52d887ef

> Add support for Hive style partitioning path
> --------------------------------------------
>
>                 Key: HUDI-353
>                 URL: https://issues.apache.org/jira/browse/HUDI-353
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Hive Integration
>            Reporter: Wenning Ding
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In Hive, the partition folder name follows this format: <partition_column_name>=<partition_value>.
> But in Hudi, the name of its partition folder is <partition_value>.
> e.g. A dataset is partitioned by three columns: year, month and day.
> In Hive, the data is saved in: {{.../<table_name>/year=2019/month=05/day=01/xxx.parquet}}
> In Hudi, the data is saved in: {{.../<table_name>/2019/05/01/xxx.parquet}}
> Basically I add a new option in Spark datasource named {{HIVE_STYLE_PARTITIONING_FILED_OPT_KEY}}
which indicates whether using hive style partitioning or not. By default this option is false
(not use).
> Also, if using hive style partitioning, instead of scanning the dataset and manually
adding/updating all partitions, we can use "MSCK REPAIR TABLE <table_name>" to automatically
sync all the partition info with Hive MetaStore.
> h3.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message