hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinoth Chandar (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-485) Check for where clause is wrong in HiveIncrementalPuller
Date Thu, 02 Jan 2020 18:04:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006991#comment-17006991
] 

Vinoth Chandar commented on HUDI-485:
-------------------------------------

Here is a brain dump and you can then take over :) 

 

Lets say the incremental query asks for all record with _`_hoodie_commit_time > t1`._ 

In a nutshell, what we actually have in the commit metadata (the .commit and .deltacommit
files) is the file slice (a base parquet file written at an instant time and a set of log
files generated as deltas on top of the base). The parquet file and the log can actually contain
records that were written before time t1 and so incremental query filters at two levels .

- First gets all the latest file slices that were written to after time t1

- Next, within these file slices, filters out records such that their __hoodie_commit_time
> t1`._ 

 

(P.S: This sort of record level metadata is what differentiates Hudi as a true streaming system
from others) 

I will take Copy on write and explain this, since its easier, but it generalized to MOR as
well.  For copy-on-write, the commit metadata points to all the parquet files that were written
(either new or versioning of an existing file) at that commit. So, by reading all the .commit
files after a given time t1, we can know all the parquet files with records written after
time t1 (superset).. But this set of files will also have older records and thus we needed
to push a filter (see IncrementalRelation.scala in hudi-spark to see logic that automatically
does this in spark) at the InputFormat level, to only return is the rows that match the hoodie_commit_time
> t1 criteria.. Pushing this to parquet is the most efficient way..  

 

When we tried to do this before [https://github.com/apache/incubator-hudi/blob/hoodie-0.3.0/hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java#L192] ,
the predicate did not actually work.  

 

 

 

 

 

 

 

 

> Check for where clause is wrong in HiveIncrementalPuller
> --------------------------------------------------------
>
>                 Key: HUDI-485
>                 URL: https://issues.apache.org/jira/browse/HUDI-485
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Incremental Pull, newbie
>            Reporter: Pratyaksh Sharma
>            Assignee: Pratyaksh Sharma
>            Priority: Major
>
> HiveIncrementalPuller checks the clause in incrementalSqlFile like this -> 
> if (!incrementalSQL.contains("`_hoodie_commit_time` > '%targetBasePath'"))
> { LOG.info("Incremental SQL : " + incrementalSQL + " does not contain `_hoodie_commit_time`
> %targetBasePath. Please add " + "this clause for incremental to work properly."); throw
new HoodieIncrementalPullSQLException( "Incremental SQL does not have clause `_hoodie_commit_time`
> '%targetBasePath', which " + "means its not pulling incrementally"); }
> Basically we are trying to add a placeholder here which is later replaced with config.fromCommitTime
here - 
> incrementalPullSQLtemplate.add("incrementalSQL", String.format(incrementalSQL, config.fromCommitTime));
> Hence, the above check needs to replaced with `_hoodie_commit_time` > %s



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message