hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-21002) Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly
Date Fri, 11 Jan 2019 22:54:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740856#comment-16740856
] 

Owen O'Malley edited comment on HIVE-21002 at 1/11/19 10:53 PM:
----------------------------------------------------------------

The behavior of Avro and Parquet is wrong both in 2.x and 3.1. The path forward should be
to match the desired Hive semantics and return '00:00:00' for new files, regardless of format.

Iceberg uses Parquet's isAdjustedToUTC = true for timestamptz, which is the equivalent of
Hive's timestamp with local time zone and isAdjustedToUTC = false for timestamp. It would
be good to match those semantics in Hive. Can we detect the version of Hive that wrote the
Parquet file to provide compatibility with old files?


was (Author: owen.omalley):
The behavior of Avro and Parquet is wrong both in 2.x and 3.1. The path forward should be
to match the desired Hive semantics and return '00:00:00' for new files, regardless of format.

Iceberg uses Parquet's isAdjustedToUTC = true for timestamptz, which is the equivalent of
Hive's timestamp with local time zone and isAdjustedToUTC = false for timestamp. It would
be good to match those semantics in Hive. Can we detect the version of Hive that wrote the
Parquet file to provide compatibility with told files?

> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet timestamps written
by Hive 2.x incorrectly
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21002
>                 URL: https://issues.apache.org/jira/browse/HIVE-21002
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.0, 3.1.1
>            Reporter: Zoltan Ivanfi
>            Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x incorrectly. As an
example session to demonstrate this problem, create a dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different storage formats
gives the following results:
> |‹format›|Time zone|Hive 2.x|Hive 3.1|
> |Avro and Parquet|America/Los_Angeles|2018-01-01 *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|Europe/Paris|2018-01-01 *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|2018-01-01 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|Europe/Paris|2018-01-01 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored in Avro
and Parquet formats.* Apache ORC behaviour has not changed because it was modified to adjust
timestamps to retain backwards compatibility. Textfile behaviour has not changed, because
its processing involves parsing and formatting instead of proper serializing and deserializing,
so they inherently had LocalDateTime semantics even in Hive 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message