hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Remus Rusanu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-5527) Use of localtime Calendar in vectorized Timestamp arithmetic results in data corruption (depends on localtime)
Date Mon, 14 Oct 2013 14:41:42 GMT

     [ https://issues.apache.org/jira/browse/HIVE-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Remus Rusanu updated HIVE-5527:
-------------------------------

    Description: 
A query like 

SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669; 

returns rows in row mode, but not in vector mode *when running in GMT+2 timezone*. 

I know what causes this, but I don’t know exactly whether is a bug or not.
The reading of the TIMESTAMP types is done in TimeStampTreeReader class, 

long ms = (result.vector[result.isRepeating ? 0 : i] + WriterImpl.BASE_TIMESTAMP)
              * WriterImpl.MILLIS_PER_SECOND;
          long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 : i]);
          // the rounding error exists because java always rounds up when dividing integers
          // -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000)
          // to get the correct value we need
          // (-42 - 1)*1000 + 999 = -42001
          // (42)*1000 + 1 = 42001
          if(ms < 0 && ns != 0) {
            ms -= 1000;
          }
          // Convert millis into nanos and add the nano vector value to it
          result.vector[i] = (ms * 1000000) + ns;

As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is declared as:

  static final long BASE_TIMESTAMP =
      Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND;

On US/Pacific time, this will be 1420099200
On EEST (GMT+2) time is 1420063200

The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and 7005 as nanos[0].
On US/Pacific, with a LONG vector timestamp value of 8875000000. On EEST it ends up with -35992125000000.
(Note how the abs(data[0]) value is smaller than the US/Pacific basetime, but bigger than
the EEST, so it goes negative on EEST and just cascades to a huge negative number).

The vector filter simply compares this with -10669 (the query WHERE clause) and it qualifies
the row on US/Pacific, but fails on EEST.

I’m not sure what the right solution is, the whole of Hive code appears to be riddled with
Timezone problems. As a side node, the build-common.xml  sets an environment variable TZ to
US/Pacific, but this has no effect in running tests on Windows. 

But the gist of it is this: in row mode the results are consistent on any time zone. In vector
mode the results vary (rows qualify for WHERE clause) depending on the timezone.


  was:I did not yet identify the root cause, but the vectorization_regress.q returns different
results depending on the local timezone settings


> Use of localtime Calendar in vectorized Timestamp arithmetic results in data corruption
(depends on localtime)
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5527
>                 URL: https://issues.apache.org/jira/browse/HIVE-5527
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Remus Rusanu
>
> A query like 
> SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669; 
> returns rows in row mode, but not in vector mode *when running in GMT+2 timezone*. 
> I know what causes this, but I don’t know exactly whether is a bug or not.
> The reading of the TIMESTAMP types is done in TimeStampTreeReader class, 
> long ms = (result.vector[result.isRepeating ? 0 : i] + WriterImpl.BASE_TIMESTAMP)
>               * WriterImpl.MILLIS_PER_SECOND;
>           long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 : i]);
>           // the rounding error exists because java always rounds up when dividing integers
>           // -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000)
>           // to get the correct value we need
>           // (-42 - 1)*1000 + 999 = -42001
>           // (42)*1000 + 1 = 42001
>           if(ms < 0 && ns != 0) {
>             ms -= 1000;
>           }
>           // Convert millis into nanos and add the nano vector value to it
>           result.vector[i] = (ms * 1000000) + ns;
> As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is declared as:
>   static final long BASE_TIMESTAMP =
>       Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND;
> On US/Pacific time, this will be 1420099200
> On EEST (GMT+2) time is 1420063200
> The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and 7005 as
nanos[0]. On US/Pacific, with a LONG vector timestamp value of 8875000000. On EEST it ends
up with -35992125000000. (Note how the abs(data[0]) value is smaller than the US/Pacific basetime,
but bigger than the EEST, so it goes negative on EEST and just cascades to a huge negative
number).
> The vector filter simply compares this with -10669 (the query WHERE clause) and it qualifies
the row on US/Pacific, but fails on EEST.
> I’m not sure what the right solution is, the whole of Hive code appears to be riddled
with Timezone problems. As a side node, the build-common.xml  sets an environment variable
TZ to US/Pacific, but this has no effect in running tests on Windows. 
> But the gist of it is this: in row mode the results are consistent on any time zone.
In vector mode the results vary (rows qualify for WHERE clause) depending on the timezone.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message