From commits-return-12479-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Tue Mar 3 08:36:02 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2FCE6180651 for ; Tue, 3 Mar 2020 09:36:02 +0100 (CET) Received: (qmail 23094 invoked by uid 500); 3 Mar 2020 08:36:01 -0000 Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list commits@hudi.apache.org Received: (qmail 23060 invoked by uid 99); 3 Mar 2020 08:36:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Mar 2020 08:36:01 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B17B2E0EAA for ; Tue, 3 Mar 2020 08:36:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 27CA8780778 for ; Tue, 3 Mar 2020 08:36:00 +0000 (UTC) Date: Tue, 3 Mar 2020 08:36:00 +0000 (UTC) From: "Bhavani Sudha (Jira)" To: commits@hudi.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HUDI-651?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D170500= 02#comment-17050002 ]=20 Bhavani Sudha commented on HUDI-651: ------------------------------------ [~vinoth] I debugged this further and found these: # In you setting the incremental querying is not triggered because the tab= le name should be something like hoodie.stock_ticks_mor.consume.mode instead of=C2=A0 hoodie.stock_ticks_mor_rt.consume.mode The reason it has to be "mor" instead of "mor_rt" =C2=A0comes from this PR = - [https://github.com/apache/incubator-hudi/pull/689]=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0Here we are identifying incremental queries base= d on the property 'hoodie.table.name' in hoodie.properties file. For mor_rt= table this property would still have 'mor' as the value.=C2=A0 # Even after I made those changes, I dint get any result. On debugging fur= ther I noticed that=C2=A0HoodieParquetRealtimeInputFormat gets splits from= =C2=A0HoodieParquetInputFormat. For the example above when there is 1 parqu= et file and two log files in this table,=C2=A0HoodieParquetInputFormat.gete= Splits() would return zero base files after applying incremental filters. A= nd=C2=A0HoodieParquetRealtimeInputFormat expects a base file in order to st= itch log files on top of it. So we see no result. I changed the condition s= uch that begintime is '0' instead of '20200302210147' {color:#172b4d}in you= r example and then it worked. I suppose we can also verify by having more f= ilegroups in this partitions after=C2=A020200302210147 and verify your quer= y works fine.{color} =C2=A0 =C2=A0 > Incremental Query on Hive via Spark SQL does not return expected results > ------------------------------------------------------------------------ > > Key: HUDI-651 > URL: https://issues.apache.org/jira/browse/HUDI-651 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Spark Integration > Reporter: Vinoth Chandar > Assignee: Bhavani Sudha > Priority: Major > Fix For: 0.6.0 > > > Using the docker demo, I added two delta commits to a MOR table and was a= hoping to incremental consume them like Hive QL.. Something amiss > {code} > scala> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor= _rt.consume.start.timestamp","20200302210147") > scala> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor= _rt.consume.mode","INCREMENTAL") > scala> spark.sql("select distinct `_hoodie_commit_time` from stock_ticks_= mor_rt").show(100, false) > +-------------------+ > |_hoodie_commit_time| > +-------------------+ > |20200302210010 | > |20200302210147 | > +-------------------+ > scala> sc.setLogLevel("INFO") > scala> spark.sql("select distinct `_hoodie_commit_time` from stock_ticks_= mor_rt").show(100, false) > 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: spark.sql.codegen.agg= regate.map.twolevel.enabled is set to true, but current version of codegene= d fast hashmap does not support this aggregate. > 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: spark.sql.codegen.agg= regate.map.twolevel.enabled is set to true, but current version of codegene= d fast hashmap does not support this aggregate. > 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as v= alues in memory (estimated size 292.3 KB, free 365.3 MB) > 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stor= ed as bytes in memory (estimated size 25.4 KB, free 365.3 MB) > 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece= 0 in memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB) > 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from=20 > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie me= tadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading HoodieTableMe= taClient from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: = [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.= xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, = hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArr= ayInputStream@5a66fc27, file:/etc/hadoop/hive-site.xml], FileSystem: [DFS[D= FSClient[clientName=3DDFSClient_NONMAPREDUCE_-1645984031_1, ugi=3Droot (aut= h:SIMPLE)]]] > 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties = from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodi= e.properties > 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Tabl= e of type MERGE_ON_READ(version=3D1) from hdfs://namenode:8020/user/hive/wa= rehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to proce= ss : 1 > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of = 1 groups > 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants [[2= 0200302210010__clean__COMPLETED], [20200302210010__deltacommit__COMPLETED],= [20200302210147__clean__COMPLETED], [20200302210147__deltacommit__COMPLETE= D]] > 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups= for partition :2018/08/31, #FileGroups=3D1 > 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: = NumFiles=3D1, FileGroupsCreationTime=3D0, StoreTimeTaken=3D0 > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to pr= ocess after hoodie filter 1 > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie me= tadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading HoodieTableMe= taClient from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: = [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.= xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, = hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArr= ayInputStream@5a66fc27, file:/etc/hadoop/hive-site.xml], FileSystem: [DFS[D= FSClient[clientName=3DDFSClient_NONMAPREDUCE_-1645984031_1, ugi=3Droot (aut= h:SIMPLE)]]] > 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties = from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodi= e.properties > 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Tabl= e of type MERGE_ON_READ(version=3D1) from hdfs://namenode:8020/user/hive/wa= rehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants [[2= 0200302210010__clean__COMPLETED], [20200302210010__deltacommit__COMPLETED],= [20200302210147__clean__COMPLETED], [20200302210147__deltacommit__COMPLETE= D]] > 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: Building file sy= stem view for partition (2018/08/31) > 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: #files found in = partition (2018/08/31) =3D3, Time taken =3D1 > 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups= for partition :2018/08/31, #FileGroups=3D1 > 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: = NumFiles=3D3, FileGroupsCreationTime=3D0, StoreTimeTaken=3D0 > 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: Time to load par= tition (2018/08/31) =3D2 > 20/03/02 21:15:37 INFO realtime.HoodieParquetRealtimeInputFormat: Returni= ng a total splits of 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)