drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill issue #937: DRILL-5002: Using hive's date functions on top of date col...
Date Wed, 13 Sep 2017 18:24:20 GMT
Github user paul-rogers commented on the issue:

    The original description talks about data with local times. The TPC-H data has no TZ.
Now, maybe we made one up in creating the Parquet files, but the original date just has dates
without a tz.
    The fundamental issue is that if we have a tz-less date, 1994-08-12, say, then this *cannot*
be converted to a UTC timestamp. Which of the 23+ time zones would we use? How would the client
and server agree on the arbitrary tz? This is like saying that I have a measurement in miles,
but we can store distances only in km, so I'll take my length of 5 miles and store it as 5
km, remembering that I'm using km as an alias for miles. Does not make sense.
    Your example uses `timestamp` constants. A timestamp is defined with a timezone, and so
it fits Drill's model well. But, TPC dates don't have a timezone. See [the TPC-H spec](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.2.pdf)
which says:
    > Date is a value whose external representation can be expressed as YYYY-MM-DD, where
all characters are numeric. A date must be able to express any day within at least 14 consecutive
years. There is no requirement specific to the internal representation of a date.
    That is, TPC-H dates are not midnight on some date in some timezone, they are just dates.
The cannot be converted to UTC. And so, they should not be subject to time zone shifting as
tzs shift.
    My point here is that Hive (according to the docs) implements functions correctly: using
tz-less dates. Drill tries to convert to a (fake) UTC and use time-based functions on that
data. This is, at best, a hack, and at worst, leads to great complexity and incorrect results.
    That said, if all we have is km, and we can't do the miles-to-km conversion correctly,
then we do need a way to know that a particular km value is actually miles. Similarly, using
the current implementation, how will we know that a particular arbitrary-local-time-encoded-as-fake-UTC
value really is local time vs. being an actual UTC time?
    All that said, if you fix makes the current implementation work better, then it is a good
    In the interests of moving ahead, let's table the basic discussion and just look at this
one fix.


View raw message