orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rei Mai (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-451) Timestamp statistics is wrong if read with useUTCTimestamp=true
Date Mon, 24 Dec 2018 19:27:00 GMT
Rei Mai created ORC-451:

             Summary: Timestamp statistics is wrong if read with useUTCTimestamp=true
                 Key: ORC-451
                 URL: https://issues.apache.org/jira/browse/ORC-451
             Project: ORC
          Issue Type: Bug
    Affects Versions: 1.5.0
         Environment: timezone for both client and server "Europe/Moscow" (UTC+3)
hive version
            Reporter: Rei Mai
         Attachments: 000000_0

We're using external orc tables and a timezone "Europe/Moscow" (UTC+3) for both client and
server. After switching to hive 3 which uses orc 1.5.x we've got an issue with predicate push
down filtering out matching stripes by timestamp. E.g. consider a table (it's orc data is
in the attachment):
{quote}{{create external table test_ts (ts timestamp) stored as orc;}}

{{insert into test_ts values ("2018-12-24 18:30:00");}}

{{// No rows selected}}

{{select * from test_ts where ts < "2018-12-24 19:00:00";}}

// the lowest filter to return the value

{{select * from test_ts where ts <= "2018-12-24 21:30:00";}}
The issue only affect external orc tables statistics. Turning ppd off with _set hive.optimize.index.filter=false;_

We believe it was the https://jira.apache.org/jira/browse/ORC-341, which introduced it.

org.apache.orc.impl.SerializationUtils utc convertion is rather strange:
{quote}public static long convertToUtc(TimeZone local, long time) {
   int offset = local.getOffset(time);
   return time + offset;
This adds a 3 hour offset to our timestamp in UTC+3 timezone (shouldn't it substract 3 hours,

If org.apache.orc.impl.TimestampStatisticsImpl is used with useUTCTimestamp=false, the timestamp
is converted back in a compatible way via SerializationUtils.convertFromUtc. But hive seems
to override default org.apache.orc.OrcFile.ReaderOptions with org.apache.hadoop.hive.ql.io.orc.ReaderOptions
which have useUTCTimestamp(true) in it's constructor. With useUTCTimestamp=true evaluatePredicateProto
predictate is using  TimestampStatisticsImpl.getMaximumUTC(), which returns the timestamp
as is, i.e. in the example it's "{{2018-12-24 21:30:00 }}UTC+3". 

At the same time the predicate is not shifted (the value in this tez log is in UTC+3):
{quote}2018-12-24 22:12:16,205 [INFO] [InputInitializer \{Map 1} #0] |orc.OrcInputFormat|:
ORC pushdown predicate: leaf-0 = (LESS_THAN ts 2018-12-24 19:30:00.0), expr = leaf-0

This message was sent by Atlassian JIRA

View raw message