From commits-return-12479-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Tue Mar  3 08:36:02 2020
Return-Path: <commits-return-12479-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 2FCE6180651
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  3 Mar 2020 09:36:02 +0100 (CET)
Received: (qmail 23094 invoked by uid 500); 3 Mar 2020 08:36:01 -0000
Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:commits-help@hudi.apache.org>
List-Unsubscribe: <mailto:commits-unsubscribe@hudi.apache.org>
List-Post: <mailto:commits@hudi.apache.org>
List-Id: <commits.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list commits@hudi.apache.org
Received: (qmail 23060 invoked by uid 99); 3 Mar 2020 08:36:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Mar 2020 08:36:01 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B17B2E0EAA
	for <commits@hudi.apache.org>; Tue,  3 Mar 2020 08:36:00 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 27CA8780778
	for <commits@hudi.apache.org>; Tue,  3 Mar 2020 08:36:00 +0000 (UTC)
Date: Tue, 3 Mar 2020 08:36:00 +0000 (UTC)
From: "Bhavani Sudha (Jira)" <jira@apache.org>
To: commits@hudi.apache.org
Message-ID: <JIRA.13289046.1583184173000.3596.1583224560162@Atlassian.JIRA>
In-Reply-To: <JIRA.13289046.1583184173000@Atlassian.JIRA>
References: <JIRA.13289046.1583184173000@Atlassian.JIRA> <JIRA.13289046.1583184173234@jira-he-de>
Subject: [jira] [Commented] (HUDI-651) Incremental Query on Hive via Spark
 SQL does not return expected results
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HUDI-651?page=3Dcom.atlassian.j=
ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D170500=
02#comment-17050002 ]=20

Bhavani Sudha commented on HUDI-651:
------------------------------------

[~vinoth] I debugged this further and found these:
 # In you setting the incremental querying is not triggered because the tab=
le name should be something like
hoodie.stock_ticks_mor.consume.mode

instead of=C2=A0

hoodie.stock_ticks_mor_rt.consume.mode

The reason it has to be "mor" instead of "mor_rt" =C2=A0comes from this PR =
- [https://github.com/apache/incubator-hudi/pull/689]=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0Here we are identifying incremental queries base=
d on the property 'hoodie.table.name' in hoodie.properties file. For mor_rt=
 table this property would still have 'mor' as the value.=C2=A0

 # Even after I made those changes, I dint get any result. On debugging fur=
ther I noticed that=C2=A0HoodieParquetRealtimeInputFormat gets splits from=
=C2=A0HoodieParquetInputFormat. For the example above when there is 1 parqu=
et file and two log files in this table,=C2=A0HoodieParquetInputFormat.gete=
Splits() would return zero base files after applying incremental filters. A=
nd=C2=A0HoodieParquetRealtimeInputFormat expects a base file in order to st=
itch log files on top of it. So we see no result. I changed the condition s=
uch that begintime is '0' instead of '20200302210147' {color:#172b4d}in you=
r example and then it worked. I suppose we can also verify by having more f=
ilegroups in this partitions after=C2=A020200302210147 and verify your quer=
y works fine.{color}

=C2=A0

=C2=A0

> Incremental Query on Hive via Spark SQL does not return expected results
> ------------------------------------------------------------------------
>
>                 Key: HUDI-651
>                 URL: https://issues.apache.org/jira/browse/HUDI-651
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: Vinoth Chandar
>            Assignee: Bhavani Sudha
>            Priority: Major
>             Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a=
 hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor=
_rt.consume.start.timestamp","20200302210147")
> scala> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor=
_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from stock_ticks_=
mor_rt").show(100, false)
> +-------------------+
> |_hoodie_commit_time|
> +-------------------+
> |20200302210010     |
> |20200302210147     |
> +-------------------+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from stock_ticks_=
mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: spark.sql.codegen.agg=
regate.map.twolevel.enabled is set to true, but current version of codegene=
d fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: spark.sql.codegen.agg=
regate.map.twolevel.enabled is set to true, but current version of codegene=
d fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as v=
alues in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stor=
ed as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece=
0 in memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from=20
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie me=
tadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading HoodieTableMe=
taClient from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: =
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.=
xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, =
hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArr=
ayInputStream@5a66fc27, file:/etc/hadoop/hive-site.xml], FileSystem: [DFS[D=
FSClient[clientName=3DDFSClient_NONMAPREDUCE_-1645984031_1, ugi=3Droot (aut=
h:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties =
from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodi=
e.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Tabl=
e of type MERGE_ON_READ(version=3D1) from hdfs://namenode:8020/user/hive/wa=
rehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to proce=
ss : 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of =
1 groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants [[2=
0200302210010__clean__COMPLETED], [20200302210010__deltacommit__COMPLETED],=
 [20200302210147__clean__COMPLETED], [20200302210147__deltacommit__COMPLETE=
D]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups=
 for partition :2018/08/31, #FileGroups=3D1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: =
NumFiles=3D1, FileGroupsCreationTime=3D0, StoreTimeTaken=3D0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to pr=
ocess after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie me=
tadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading HoodieTableMe=
taClient from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: =
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.=
xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, =
hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArr=
ayInputStream@5a66fc27, file:/etc/hadoop/hive-site.xml], FileSystem: [DFS[D=
FSClient[clientName=3DDFSClient_NONMAPREDUCE_-1645984031_1, ugi=3Droot (aut=
h:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties =
from hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodi=
e.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Tabl=
e of type MERGE_ON_READ(version=3D1) from hdfs://namenode:8020/user/hive/wa=
rehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants [[2=
0200302210010__clean__COMPLETED], [20200302210010__deltacommit__COMPLETED],=
 [20200302210147__clean__COMPLETED], [20200302210147__deltacommit__COMPLETE=
D]]
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: Building file sy=
stem view for partition (2018/08/31)
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: #files found in =
partition (2018/08/31) =3D3, Time taken =3D1
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups=
 for partition :2018/08/31, #FileGroups=3D1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: =
NumFiles=3D3, FileGroupsCreationTime=3D0, StoreTimeTaken=3D0
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: Time to load par=
tition (2018/08/31) =3D2
> 20/03/02 21:15:37 INFO realtime.HoodieParquetRealtimeInputFormat: Returni=
ng a total splits of 1
> {code}


--
This message was sent by Atlassian Jira
(v8.3.4#803005)