hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhavani Sudha (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results
Date Tue, 03 Mar 2020 19:13:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050488#comment-17050488
] 

Bhavani Sudha commented on HUDI-651:
------------------------------------

:) Apologize for the ambiguity. I should use the appropriate terms. Let me try one more time.

Assume there is one file group that has only one base file and one or more log files. In this
case, the result of the your incremental query would be empty. Like you understood, the base
file gets filtered out on commit time. If there are more base files, depending on the commit
time filter used, the result can be non empty. 

> Incremental Query on Hive via Spark SQL does not return expected results
> ------------------------------------------------------------------------
>
>                 Key: HUDI-651
>                 URL: https://issues.apache.org/jira/browse/HUDI-651
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: Vinoth Chandar
>            Assignee: Bhavani Sudha
>            Priority: Major
>             Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a hoping to incremental
consume them like Hive QL.. Something amiss
> {code}
> scala> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from stock_ticks_mor_rt").show(100,
false)
> +-------------------+
> |_hoodie_commit_time|
> +-------------------+
> |20200302210010     |
> |20200302210147     |
> +-------------------+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from stock_ticks_mor_rt").show(100,
false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled
is set to true, but current version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled
is set to true, but current version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as values in memory
(estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored as bytes
in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in memory
on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie metadata from
path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020],
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27,
file:/etc/hadoop/hive-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1,
ugi=root (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1)
from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants [[20200302210010__clean__COMPLETED],
[20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for partition
:2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=1,
FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to process after
hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie metadata from
path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020],
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27,
file:/etc/hadoop/hive-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1,
ugi=root (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1)
from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants [[20200302210010__clean__COMPLETED],
[20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: Building file system view for
partition (2018/08/31)
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: #files found in partition (2018/08/31)
=3, Time taken =1
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for partition
:2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=3,
FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: Time to load partition (2018/08/31)
=2
> 20/03/02 21:15:37 INFO realtime.HoodieParquetRealtimeInputFormat: Returning a total splits
of 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message