hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Filipchik (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-512) Decouple logical partitioning from physical one.
Date Wed, 08 Jan 2020 05:34:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Filipchik updated HUDI-512:
-------------------------------------
    Description: 
This one is more inspirational, but, I believe, will be very useful. Currently hudi is following
Hive table format, which means that data is logically and physically partitioned into folder
structure like:

table_name

  2019

    01

    02

       bla.parquet

 

This has several issues:

 1) Modern object sores (AWS S3, GCP) are more performant when each file name starts with
some kind of a random value. By definition Hive layout is not perfect

2) Hive Metastore stores partitions in the text field in the single table (2 tables with very
similar information) and doesn't support proper filtering. Data partitioned by day will be
stored like:

2019/01/10

2019/01/11

so only regexp queries are suported (at least in Hive 2.X.X)

3) Having a single POF which relies on non distributed DB is dangerous and creates bottlenecks. 

 

The idea is to get rid of logical partitioning all together (and hive metastore as well).
If dataset has a time columns, user should be able to query it without understanding what
is the physical layout of the table (by specifying those partitions explicitly or ending up
with a full table scan accidentally).

It will require some kind of mapping of time to file locations (similar to Iceberg). I'm also
leaning towards the idea that storing table metadata with the table is a good thing as it
can be read by the engine in one shot and will be faster that taxing a standalone metastore. 

  was:
This one is more inspirational, but, I believe, will be very useful. Currently hudi is following
Hive table format, which means that data is logically and physically partitioned into folder
structure like:

table_name

  2019

    01

    02

       bla.parquet

 

This has several issues:

 1) Modern object sores (AWS S3, GCP) are more performant when each file name starts with
some kind of a random value. By definition Hive layout is not perfect

2) Hive Metastore stores partitions in the text field in the single table (2 tables with very
similar information) and doesn't support proper filtering. Data partitioned by day will be
stored like:

2019/01/10

2019/01/11

so only regexp queries are suported (at least in Hive 2.X.X)

3) Having a single POF which relies on non distributed DB is dangerous and creates bottlenecks. 

 

The idea is to get rid of logical partitioning all together. If dataset has a time columns,
user should be able to query it without understanding what is the physical layout of the table
(by specifying those partitions explicitly or ending up with a full table scan accidentally).

It will require some kind of mapping of time to file locations (similar to Iceberg). I'm also
leaning towards the idea that storing table metadata with the table is a good thing as it
can be read by the engine in one shot and will be faster that taxing a standalone metastore. 


> Decouple logical partitioning from physical one. 
> -------------------------------------------------
>
>                 Key: HUDI-512
>                 URL: https://issues.apache.org/jira/browse/HUDI-512
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Common Core
>            Reporter: Alexander Filipchik
>            Priority: Major
>              Labels: features
>
> This one is more inspirational, but, I believe, will be very useful. Currently hudi is
following Hive table format, which means that data is logically and physically partitioned
into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object sores (AWS S3, GCP) are more performant when each file name starts
with some kind of a random value. By definition Hive layout is not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 tables with
very similar information) and doesn't support proper filtering. Data partitioned by day will
be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive metastore as well).
If dataset has a time columns, user should be able to query it without understanding what
is the physical layout of the table (by specifying those partitions explicitly or ending up
with a full table scan accidentally).
> It will require some kind of mapping of time to file locations (similar to Iceberg).
I'm also leaning towards the idea that storing table metadata with the table is a good thing
as it can be read by the engine in one shot and will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message