hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <b...@apache.org>
Subject Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug
Date Fri, 22 Jan 2016 23:15:32 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/#review115731
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java (line 139)
<https://reviews.apache.org/r/41821/#comment176776>

    One of the test cases listed in the scope doc is when the zone ID is unknown, like "Antarctica/South_Pole",
that isn't valid. In that case the table should not be readable because the correct zone offset
can't be determined.
    
    `TimeZone.getTimeZone(String)` will return GMT when the zone id isn't recognized, so I'm
concerned that this will do the wrong thing and use GMT in some cases where the zone should
be rejected.
    
    See https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html#getTimeZone(java.lang.String)
    You might need to make a set of available IDs using https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html#getAvailableIDs()



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java (line 168)
<https://reviews.apache.org/r/41821/#comment176778>

    It looks like this removes support for `HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION.varname`.
I think we should maintain support for that configuration setting since customers may already
be using it and we don't want to change behavior. If it is set to true, it should cause the
calendar to use UTC (so no adjustment is made). Howevre, the new table property should override
the Hive setting. So I think the check should be in the Strings.isNullOrEmpty case.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java (line 57)
<https://reviews.apache.org/r/41821/#comment176779>

    Nit: We use UTC elsewhere. I believe that they are identical for our purposes, but I want
to reduce confusion and would recommend UTC to GMT.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java (line 211)
<https://reviews.apache.org/r/41821/#comment176783>

    The file metadata is also passed in [via InitContext](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L173).
If the file has the write zone property set, then it should be validated against the configured
write zone (if set). When present, the file's value should override the table value and there
should be a warning if the table value doesn't match the file value. That covers cases where
files are moved from one table to another.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java (line
161)
<https://reviews.apache.org/r/41821/#comment176785>

    Shouldn't this use UTC, which will apply an offset of 0?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java (line
165)
<https://reviews.apache.org/r/41821/#comment176787>

    When will the zone property be set before this method? Is this how the table property
is passed in?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java (line 21)
<https://reviews.apache.org/r/41821/#comment176788>

    I think calling this the default write zone is not quite correct. It is the default zone
when the Hive config property is set, but the current zone is the default otherwise. What
about calling this the PARQUET_INT96_NO_ADJUSTMENT_ZONE?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java (line 53)
<https://reviews.apache.org/r/41821/#comment176792>

    This is a little odd. Normally, I would consider GMT/UTC to be the calendar that applies
no adjustment. But what is happening in this code is that the UTC millisecond value is being
constructed using values in that zone and then a new timestamp is created from it that is
in the local zone. So local->local produces no adjustment, while GMT->local does.
    
    I was thinking that GMT would always be used to create the milliseconds, then the calendar
would be used to adjust the value by some amount. UTC by 0, EST by -5, and PST by -8.
    
    I think it may be easier to have a method that does a single conversion to UTC in milliseconds.
Then, adjust the value using a static offset like the one from `TimeZone.getOffset(utc_ms)`.
That would result in a bit less work done here and I think would make the adjustment easier
to reason about.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java (line
35)
<https://reviews.apache.org/r/41821/#comment176981>

    What is the purpose of this property? Is this just a way to pass the time zone? Why not
use the property used elsewhere?



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java (line 227)
<https://reviews.apache.org/r/41821/#comment176983>

    It looks like this tests that the value written is the result of `getNanoTime(Timestamp,Calendar).toBinary()`,
but doesn't test what that value is or should be. That is probably the intent, but I wanted
to make sure.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
(line 214)
<https://reviews.apache.org/r/41821/#comment176985>

    This is a great test for round trip, but I think we should also have a test with a set
of corrupt values from a file written with 5.5 in America/Los_Angeles (since that's convenient).


- Ryan Blue


On Jan. 13, 2016, 12:36 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41821/
> -----------------------------------------------------------
> 
> (Updated Jan. 13, 2016, 12:36 p.m.)
> 
> 
> Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.
> 
> 
> Bugs: HIVE-12767
>     https://issues.apache.org/jira/browse/HIVE-12767
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The following exit criteria is addressed in this patch:
> 
> * Hive will read Parquet MR int96 timestamp data and adjust values using a time zone
from a table property, if set, or using the local time zone if it is absent. No adjustment
will be applied to data written by Impala.
> 
> * Hive will write Parquet int96 timestamps using a time zone adjustment from the same
table property, if set, or using the local time zone if it is absent. This keeps the data
in the table consistent.
> 
> * New tables created by Hive will set the table property to UTC if the global option
to set the property for new tables is enabled.
>   * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property
unless the global setting to do so is enabled.
>   * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property
of the table that is copied.
> 
> To set the timezone table property, use this:
>   create table tbl1 (ts timestamp) stored as parquet tblproperties ('parquet.mr.int96.write.zone'='PST');
>   
> To set UTC as default timezone table property on new tables created, use this: 
>   set parquet.mr.int96.enable.utc.write.zone=true;
>   create table tbl2 (ts timestamp) stored as parquet;
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1bcdc5f49e1a4a0f357842e88cf5fd359685b5ef

>   data/files/impala_int96_timestamp.parq PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java deec8bba45c130c5dfdc482522c0825a71af9d2c

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
53f3b72b790d87a75a7cd1d77d8f011c29c41188 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
74a1a82047613189678716f765bfaa9ac39b7618 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41821/diff/
> 
> 
> Testing
> -------
> 
> Added unit and q-tests:
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message