hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanjia Gary Li (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions
Date Mon, 02 Mar 2020 06:29:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yanjia Gary Li updated HUDI-597:
--------------------------------
    Description: 
For the use case that I only need to pull the incremental part of certain partitions, I need
to do the incremental pulling from the entire dataset first then filtering in Spark.

If we can use the folder partitions directly as part of the input path, it could run faster
by only load relevant parquet files.

Example:

 
{code:java}
spark.read.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
.option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*")
.load(path)
 
{code}
 

  was:
For the use case that I only need to pull the incremental part of certain partitions, I need
to do the incremental pulling from the entire dataset first then filtering in Spark.

If we can use the folder partitions directly as part of the input path, it could run faster
by only load relevant parquet files.

Example:

 
{code:java}
spark.read.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
.load(path, "year=2020/*/*/*")
 
{code}
 


> Enable incremental pulling from defined partitions
> --------------------------------------------------
>
>                 Key: HUDI-597
>                 URL: https://issues.apache.org/jira/browse/HUDI-597
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>            Reporter: Yanjia Gary Li
>            Assignee: Yanjia Gary Li
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.5.2
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain partitions,
I need to do the incremental pulling from the entire dataset first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it could run
faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*")
> .load(path)
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message