hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suneel Marthi (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time
Date Sun, 02 Feb 2020 23:42:02 GMT

     [ https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Suneel Marthi updated HUDI-415:
-------------------------------
    Fix Version/s:     (was: 0.5.1)
                   0.5.2

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -----------------------------------------------------------------------------
>
>                 Key: HUDI-415
>                 URL: https://issues.apache.org/jira/browse/HUDI-415
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>            Reporter: Yanjia Gary Li
>            Assignee: Yanjia Gary Li
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.5.2
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a heavy transformation
before isEmpty(), then the commit time could be inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, commitTime, operation)
> {code}
> For example, I start the spark job at 201901010000, but *isEmpty()* ran for 2 hours,
then the commit time in the .hoodie folder will be 201901010*2*00. If I use the commit time
to ingest data starting from 201901010200(from HDFS, not using deltastreamer), then I will
miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message