drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5002) Using hive's date functions on top of date column gives wrong results for local time-zone
Date Wed, 13 Sep 2017 18:25:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165075#comment-16165075
] 

ASF GitHub Bot commented on DRILL-5002:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/937
  
    The original description talks about data with local times. The TPC-H data has no TZ.
Now, maybe we made one up in creating the Parquet files, but the original date just has dates
without a tz.
    
    The fundamental issue is that if we have a tz-less date, 1994-08-12, say, then this *cannot*
be converted to a UTC timestamp. Which of the 23+ time zones would we use? How would the client
and server agree on the arbitrary tz? This is like saying that I have a measurement in miles,
but we can store distances only in km, so I'll take my length of 5 miles and store it as 5
km, remembering that I'm using km as an alias for miles. Does not make sense.
    
    Your example uses `timestamp` constants. A timestamp is defined with a timezone, and so
it fits Drill's model well. But, TPC dates don't have a timezone. See [the TPC-H spec](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.2.pdf)
which says:
    
    > Date is a value whose external representation can be expressed as YYYY-MM-DD, where
all characters are numeric. A date must be able to express any day within at least 14 consecutive
years. There is no requirement specific to the internal representation of a date.
    
    That is, TPC-H dates are not midnight on some date in some timezone, they are just dates.
The cannot be converted to UTC. And so, they should not be subject to time zone shifting as
tzs shift.
    
    My point here is that Hive (according to the docs) implements functions correctly: using
tz-less dates. Drill tries to convert to a (fake) UTC and use time-based functions on that
data. This is, at best, a hack, and at worst, leads to great complexity and incorrect results.
    
    That said, if all we have is km, and we can't do the miles-to-km conversion correctly,
then we do need a way to know that a particular km value is actually miles. Similarly, using
the current implementation, how will we know that a particular arbitrary-local-time-encoded-as-fake-UTC
value really is local time vs. being an actual UTC time?
    
    All that said, if you fix makes the current implementation work better, then it is a good
improvement.
    
    In the interests of moving ahead, let's table the basic discussion and just look at this
one fix.


> Using hive's date functions on top of date column gives wrong results for local time-zone
> -----------------------------------------------------------------------------------------
>
>                 Key: DRILL-5002
>                 URL: https://issues.apache.org/jira/browse/DRILL-5002
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Hive, Storage - Parquet
>            Reporter: Rahul Challapalli
>            Assignee: Vitalii Diravka
>            Priority: Critical
>         Attachments: 0_0_0.parquet
>
>
> git.commit.id.abbrev=190d5d4
> Wrong Result 1 :
> {code}
> select l_shipdate, `month`(l_shipdate) from cp.`tpch/lineitem.parquet` where l_shipdate
= date '1994-02-01' limit 2;
> +-------------+---------+
> | l_shipdate  | EXPR$1  |
> +-------------+---------+
> | 1994-02-01  | 1       |
> | 1994-02-01  | 1       |
> +-------------+---------+
> {code}
> Wrong Result 2 : 
> {code}
> select l_shipdate, `day`(l_shipdate) from cp.`tpch/lineitem.parquet` where l_shipdate
= date '1998-06-02' limit 2;
> +-------------+---------+
> | l_shipdate  | EXPR$1  |
> +-------------+---------+
> | 1998-06-02  | 1       |
> | 1998-06-02  | 1       |
> +-------------+---------+
> {code}
> Correct Result :
> {code}
> select l_shipdate, `month`(l_shipdate) from cp.`tpch/lineitem.parquet` where l_shipdate
= date '1998-06-02' limit 2;
> +-------------+---------+
> | l_shipdate  | EXPR$1  |
> +-------------+---------+
> | 1998-06-02  | 6       |
> | 1998-06-02  | 6       |
> +-------------+---------+
> {code}
> It looks like we are getting wrong results when the 'day' is '01'. I only tried month
and day hive functions....but wouldn't be surprised if they have similar issues too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message