drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chunhui Shi <c...@mapr.com>
Subject Re: Drill date & time types encoding
Date Fri, 17 Mar 2017 01:23:44 GMT
I think they are using the same timezone data from IANA,

For Java the timezone data can be found under jre/lib/zi, and Oracle has a timezone update
tool too.
For enterprise software vendors, timezone update actually is a big thing.
________________________________
From: Boaz Ben-Zvi <bben-zvi@mapr.com>
Sent: Thursday, March 16, 2017 6:08:01 PM
To: dev@drill.apache.org
Subject: Re: Drill date & time types encoding

  Timezone calculations are not simple ( e.g.,  “2017-03-11 23:30:00-PST” + INTERVAL ‘3’
HOURS  --> need to know about daylight savings time, etc.)



  Linux does have a timezone. The actual implementation is quite complex – it keeps an elaborate
“database” under /usr/share/zoneinfo , (which needs to be updated periodically, e.g. by
running “yum update tzdata”).



Is Java’s TZ support (https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html)
equivalent to Linux ?



-       Boaz



On 3/16/17, 4:48 PM, "Paul Rogers" <progers@mapr.com> wrote:



    Thanks all for the explanations!



    Did a bit of poking around. See DRILL-5360. For the Timestamp type:



    * Literals are claimed to be in UTC (have not yet tested)

    * Value vectors store Timestamps in server local time

    * Drill clients get the Timestamp in server local time

    * JDBC clients try to convert server local time to UTC, but use the client timezone to
do so.



    The result is that clients must know the server timezone, but Drill does not provide this
info. Drill clients must convert from server timezone to UTC to get the UTC value of a timestamp.



    JDBC clients must convert from the “UTC” given from JDBC to true UTC by subtracting
the difference between server and client timezone offsets.



    I suspect, as Jinfeng points out, that much of the confusion comes from the conflicting
use of the term “timestamp” in the SQL 2011 standard [1] and standard Linux/Java practice.



    In Linux and Java, a “timestamp” is ms since the Unix epoch, UTC. (That is, the UTC
timestamp is implied, so all machines anywhere agree on what a time means.)



    The SQL TIMESTAMP is what most databases call a DATETIME: a combination of a date and
time that are “free floating”: there is no implied time zone. “3 PM” is just that,
it does not imply “3 PM in Paris.”



    SQL provides a TIMESTAMP WITH TIME ZONE, but that also differs from Linux practice: it
is not a UTC time but rather a DATETIME with a associated timezone.



    The Drill Timestamp is neither of these. It like a TIMESTAMP WITH TIMEZONE where the timezone
is the server local timezone. But, Drill does not specify that timezone, so the client “just
has to know.” Unlike the Linux timestamp, the client & server don’t agree ahead of
time by convention; instead every server can have its own Timestamp timezone and the client
must figure out the corresponding UTC or client local time.



    What we have can work with clever adjustment programming. But it would be better (for
wider adoption) to provide a cleaner, more deterministic API.



    Unfortunately, we probably can’t fix the existing Timestamp as there is probably already
code that tries (like JDBC) to work around the current behavior.



    Instead, we should add the SQL TIMESTAMP WITH TIMEZONE. Or add a non-standard “LinuxTimezone”
(or “TimezoneUTC”) that sores times in an agreed-upon UTC format.



    Until then, tread carefully.



    - Paul



    [1] http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip



    > On Mar 16, 2017, at 4:25 PM, Jinfeng Ni <jni@apache.org> wrote:

    >

    > My understanding is TIME/TIMESTAMP in Drill is TIME/TIMESTAMP without

    > timezone. TimeStampTZ is for TIMESTAMP with timezone, which Drill

    > probably does not fully support.

    >

    > SQL standards has  DATE, TIME WITHOUT TIME ZONE, TIMESTAMP WITHOUT

    > TIME ZONE, TIME WITH TIME ZONE, or TIMESTAMP WITH TIME ZONE.

    > Time/Timestamp without t/z should be interpreted as local time.

    >

    > Here is some descriptions in SQL 2011 : Sec 4.6.2.

    >

    > "

    >

    > A datetime data type that specifies WITH TIME ZONE is a data type that

    > is datetime with time zone, while a datetime data type that specifies

    > WITHOUT TIME ZONE is a data type that is datetime without time zone.

    >

    > The surface of the earth is divided into zones, called time zones, in

    > which every correct clock tells the same time, known as local time.

    > Local time is equal to UTC (Coordinated Universal Time) plus the time

    > zone dis- placement, which is an interval value that ranges between

    > INTERVAL '–14:00' HOUR TO MINUTE and INTERVAL '+14:00' HOUR TO MINUTE.

    >

    > A datetime value, of data type TIME WITHOUT TIME ZONE or TIMESTAMP

    > WITHOUT TIME ZONE, may represent a local time, whereas a datetime

    > value of data type TIME WITH TIME ZONE or TIMESTAMP WITH TIME ZONE

    > represents UTC.

    >

    > "

    >

    > On Tue, Mar 14, 2017 at 11:56 AM, Julian Hyde <jhyde@apache.org> wrote:

    >> I don’t think 4713 BC comes from the SQL standard. That is a Postgres thing.

    >>

    >> I believe that the standard says you should support timestamp precision up to
9 (i.e. nanoseconds). 2 ^ 64 nanoseconds is 584 years. So, it’s not possible to cram all
of the timestamp values we’d like into a 64 bit integer if you are going to support such
a large time range.

    >>

    >> Julian

    >>

    >>> On Mar 14, 2017, at 11:29 AM, Boaz Ben-Zvi <bben-zvi@mapr.com> wrote:

    >>>

    >>> Thanks for the detailed research, Paul,

    >>>

    >>> INTERVAL by the (SQL-99 ?) standard should be either a SECOND-DAY interval
or a MONTH-YEAR interval. This is a result of the inconsistent “number of days in a month”.

    >>>

    >>> DATE expressed in days, starting at 4713-BC sound like the standard as well
(at least same as Postgres). (should be implemented as 4 byte)

    >>>

    >>> TIME is only within a 24 hour period, so why should it care about 2001 ?
 Probably the documentation should be fixed. (should be implemented as 4 byte)

    >>>

    >>> TIMESTAMP should be an 8 byte type (maybe starting at 4713-BC as well, to
match DATE (and Postgres :-)).

    >>>

    >>>   Thanks,

    >>>

    >>>          — Boaz

    >>>

    >>>

    >>> On Mar 13, 2017, at 3:46 PM, Paul Rogers <progers@mapr.com<mailto:progers@mapr.com>>
wrote:

    >>>

    >>> Thanks Parth!

    >>>

    >>> The date and time definitions are the “classic” ones, but conflict with
the Drill documentation:

    >>>

    >>> http://drill.apache.org/docs/supported-data-types/

    >>>

    >>> DATE Years, months, and days in YYYY-MM-DD format since 4713 BC

    >>>

    >>> TIME 24-hour based time before or after January 1, 2001 in hours, minutes,
seconds format: HH:mm:ss

   >>>

    >>> Which is correct?

    >>>

    >>> If the documentation is wrong, we can file a JIRA to correct it. (It may
not even be wrong, since one can convert from one to the other easily, it may just be misleading…)

    >>>

    >>> Also note that, according to C++, DATE and TIME and TIMESTAMP are exactly
the same, but the TIME as as 32-bit number, could only hold about 2 years due to limited range.

    >>>

    >>> Also, according to SQL, DATE has no time zone, it is just a date. That is,
2016-03-13 is the same date in PST or GMT. If DATE were seconds since the UTC epoch, dates
would be different in different time zones. So, I assume we use the Unix epoch, but without
an implied UTC time zone as is usual for Linux and Windows timestamps?

    >>>

    >>> How does a TIMESTAMP differ from a DATE? Perhaps a TIMESTAMP is based on
the epoch UTC while DATE has no implied time zone?

    >>>

    >>> Again, the documentation differs:

    >>>

    >>> INTERVAL (Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR.)
A day-time or year-month interval

    >>>

    >>> TIMESTAMP JDBC timestamp in year, month, date hour, minute, second, and optional
milliseconds format: yyyy-MM-dd HH:mm:ss.SSS

    >>>

    >>> So, sounds like we have an INTERVALDAY and INTERVAL year, but do we or do
we not have an INTERVAL?

    >>>

    >>> If anyone knows, please let me know, else I need to do some poking around...

    >>>

    >>> Thanks,

    >>>

    >>> - Paul

    >>>

    >>> On Mar 13, 2017, at 2:44 PM, Parth Chandra <parthc@apache.org<mailto:parthc@apache.org>>
wrote:

    >>>

    >>> Paul asked this and I'm posting here so someone who knows better can

    >>> correct me if I'm wrong ( This is from my notes when I was young)

    >>>

    >>> DATE : Int64 : Milliseconds from Unix Epoch : 1/1/1970 00:00:00

    >>> TIME : Int32 : Milliseconds from midnight on 1/1/1970

    >>> TimeStampTZ : Int64 + Int32 : (Milliseconds from epoch + Index into list
of

    >>> TimeZones)

    >>> TimeStamp : Int64 : Milliseconds from epoch

    >>> Interval : Int32 + Int32 + Int32 : Month + Days + Milliseconds

    >>> Interval Day : Int32 + Int32 : Days + Milliseconds

    >>> Interval Year : Int32 : Month

    >>>

    >>> A slightly readable version of these can be found in the C++ client :).

    >>> $drill_src/contrib/native/client/src/include/drill/recordbatch.hpp which

    >>> has a bunch of 'Holder' structs for the date-time types.

    >>>

    >>> HTH

    >>>

    >>> Parth

    >>>

    >>>

    >>





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message