hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barna Zsombor Klara <zsombor.kl...@cloudera.com>
Subject Re: Review Request 58501: HIVE-16469: Parquet timestamp table property is not always taken into account
Date Wed, 03 May 2017 12:59:42 GMT


> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java
> > Line 72 (original), 72 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695509#file1695509line72>
> >
> >     How does this work? I don't understand this change.

The user.timezone system property is used to set the default timezone of the JVM. If this
is set on the HS2 instance then we need to propagate it to the child VM spawned by a local
task or timestamps read by the local task will be incorrect.


> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
> > Line 181 (original), 181 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695511#file1695511line181>
> >
> >     Is this compatible with old parquet tables? if the property is not set, then
the validateTimeZone    might fail, right? If so, do we want to fail reading tables that do
not have a property set?
> >     
> >     Something else to consider, if a user sets a timezone improperly in a different
tool or something      happened that we got an invalid timezone, then do we want to fail when
reading those files? Just      wondering this scenario, no need to fix it right away.

At this point the timezone property had to be set by ParquetTableUtils#setParquetTimeZoneIfAbsent
either from the table properties or using the default value TimeZone#getDefault. The core
problem is that I found it very difficult to make sure that <every> execution path will
check the table property.
- The FetchOperator works when we have a local task, but the MapRedParquetInputFormat does
not (MapWork is null). 
- The FetchOperator will not work with a complex query or an order by clause, but the InputFormat
should work in this case. 
- For statistics gathering only the StatNoJobTask is executed.
I wanted to make sure that if we have an execution path I forgot about, then we should rather
fail than to read incorrect timestamp values silently.
Similarly in my opinion if the timezone value is incorrect (because it was set by another
tool) then we should fail instead of reading illadjusted values.


> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java
> > Lines 35 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695512#file1695512line35>
> >
> >     Why is Map<?, ?> used instead of Map<String, String>? Aren't all
table properties key, value string pairs?
> >     
> >     Also, the ensureTablePropertySet() name seems not related to what we want to
do. I thought it was going to throw an exception if the property was not set, but it is setting
the value on the JobConf. Should we use a different name, such as setParquetTimeZoneIfNotSet(),
     setParquetTimeZoneIfAbsent() or something like that helps us understand quickly without
looking at the javadoc.

We are calling this method with Properties objects (i.e. from the FetchOperator) and using
Map<String, String> objects (i.e. from the StatsNoJobTask) and the common ancestor for
these two is the Map<?,?>. While it is true that the table properties can only be Strings
so the Properties should only contain String pairs I wanted to avoid the explicit cast.


- Barna Zsombor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58501/#review173610
-----------------------------------------------------------


On May 3, 2017, 12:59 p.m., Barna Zsombor Klara wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58501/
> -----------------------------------------------------------
> 
> (Updated May 3, 2017, 12:59 p.m.)
> 
> 
> Review request for hive, Sergio Pena and Zoltan Ivanfi.
> 
> 
> Bugs: HIVE-16469
>     https://issues.apache.org/jira/browse/HIVE-16469
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16469: Parquet timestamp table property is not always taken into account
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 757b7fc0eaa39c956014aa446ab1b07fc4abf8d3

>   ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java 13750cdc34711d22f2adf2f483a6773ad05fb8d2

>   ql/src/java/org/apache/hadoop/hive/ql/exec/StatsNoJobTask.java 9c3a664b9aea2d6e050ffe2d7626127827dbc52a

>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java 1bd4db7805689ae1f91921ffbb5ff7da59f4bf60

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java f4fadbb61bf45f62945700284c0b050f0984b696

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java 2954601ce5bb25905cdb29ca0ca4551c2ca12b95

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 6413c5add6db2e8c9298285b15dba33ee74379a8

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java b339cc4347eea143dca2f6d98f9aa7777afdc427

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java dbd6fb3d0bc8c753abf86e99b52377617f248b5a

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/AbstractTestParquetDirect.java c81499a91c84af3ba33f335506c1c44e7085f13d

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRowGroupFilter.java bf363f32a3ac0a4d790e2925d802c6e210adfb4b

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
f2d79cf9d215e9a6e2a5e88cfc78378be860fd1f 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestNanoTimeUtils.java 1e10dbf18742524982606f1e6c6d447d683b2dc3

>   ql/src/test/queries/clientnegative/parquet_int96_alter_invalid_timezone.q PRE-CREATION

>   ql/src/test/queries/clientnegative/parquet_int96_create_invalid_timezone.q PRE-CREATION

>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q 6eadd1b0a3313cbba7a798890b802baae302749e

>   ql/src/test/results/clientnegative/parquet_int96_alter_invalid_timezone.q.out PRE-CREATION

>   ql/src/test/results/clientnegative/parquet_int96_create_invalid_timezone.q.out PRE-CREATION

>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out b9a3664458a83f1856e4bc59eba5d56665df61cc

>   ql/src/test/results/clientpositive/spark/parquet_int96_timestamp.q.out PRE-CREATION

> 
> 
> Diff: https://reviews.apache.org/r/58501/diff/4/
> 
> 
> Testing
> -------
> 
> Added qtests for the following cases:
> - order by clause
> - selfjoin
> - calling UDFs with the timestamp values
> - where clause with a constant cast as timestamp
> - test for HoS
> - implicit and explicit timestamp conversions in insert clause
> 
> Tested manually but no qtests:
> - join between 3 tables all parquet but with different/no timezone property
> - subselect in from/where clauses
> - exists / union / no exists
> 
> 
> Thanks,
> 
> Barna Zsombor Klara
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message