Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B59FB200CF3 for ; Wed, 13 Sep 2017 20:25:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B463A1609CB; Wed, 13 Sep 2017 18:25:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0E5A21609C3 for ; Wed, 13 Sep 2017 20:25:06 +0200 (CEST) Received: (qmail 10963 invoked by uid 500); 13 Sep 2017 18:25:05 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 10954 invoked by uid 99); 13 Sep 2017 18:25:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Sep 2017 18:25:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C68A5CA665 for ; Wed, 13 Sep 2017 18:25:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.201 X-Spam-Level: X-Spam-Status: No, score=-99.201 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id fp4FvBgwSONh for ; Wed, 13 Sep 2017 18:25:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id A7E745FD7D for ; Wed, 13 Sep 2017 18:25:03 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 63F6DE0EAC for ; Wed, 13 Sep 2017 18:25:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 536C325387 for ; Wed, 13 Sep 2017 18:25:00 +0000 (UTC) Date: Wed, 13 Sep 2017 18:25:00 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-5002) Using hive's date functions on top of date column gives wrong results for local time-zone MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 13 Sep 2017 18:25:07 -0000 [ https://issues.apache.org/jira/browse/DRILL-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165075#comment-16165075 ] ASF GitHub Bot commented on DRILL-5002: --------------------------------------- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/937 The original description talks about data with local times. The TPC-H data has no TZ. Now, maybe we made one up in creating the Parquet files, but the original date just has dates without a tz. The fundamental issue is that if we have a tz-less date, 1994-08-12, say, then this *cannot* be converted to a UTC timestamp. Which of the 23+ time zones would we use? How would the client and server agree on the arbitrary tz? This is like saying that I have a measurement in miles, but we can store distances only in km, so I'll take my length of 5 miles and store it as 5 km, remembering that I'm using km as an alias for miles. Does not make sense. Your example uses `timestamp` constants. A timestamp is defined with a timezone, and so it fits Drill's model well. But, TPC dates don't have a timezone. See [the TPC-H spec](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.2.pdf) which says: > Date is a value whose external representation can be expressed as YYYY-MM-DD, where all characters are numeric. A date must be able to express any day within at least 14 consecutive years. There is no requirement specific to the internal representation of a date. That is, TPC-H dates are not midnight on some date in some timezone, they are just dates. The cannot be converted to UTC. And so, they should not be subject to time zone shifting as tzs shift. My point here is that Hive (according to the docs) implements functions correctly: using tz-less dates. Drill tries to convert to a (fake) UTC and use time-based functions on that data. This is, at best, a hack, and at worst, leads to great complexity and incorrect results. That said, if all we have is km, and we can't do the miles-to-km conversion correctly, then we do need a way to know that a particular km value is actually miles. Similarly, using the current implementation, how will we know that a particular arbitrary-local-time-encoded-as-fake-UTC value really is local time vs. being an actual UTC time? All that said, if you fix makes the current implementation work better, then it is a good improvement. In the interests of moving ahead, let's table the basic discussion and just look at this one fix. > Using hive's date functions on top of date column gives wrong results for local time-zone > ----------------------------------------------------------------------------------------- > > Key: DRILL-5002 > URL: https://issues.apache.org/jira/browse/DRILL-5002 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Hive, Storage - Parquet > Reporter: Rahul Challapalli > Assignee: Vitalii Diravka > Priority: Critical > Attachments: 0_0_0.parquet > > > git.commit.id.abbrev=190d5d4 > Wrong Result 1 : > {code} > select l_shipdate, `month`(l_shipdate) from cp.`tpch/lineitem.parquet` where l_shipdate = date '1994-02-01' limit 2; > +-------------+---------+ > | l_shipdate | EXPR$1 | > +-------------+---------+ > | 1994-02-01 | 1 | > | 1994-02-01 | 1 | > +-------------+---------+ > {code} > Wrong Result 2 : > {code} > select l_shipdate, `day`(l_shipdate) from cp.`tpch/lineitem.parquet` where l_shipdate = date '1998-06-02' limit 2; > +-------------+---------+ > | l_shipdate | EXPR$1 | > +-------------+---------+ > | 1998-06-02 | 1 | > | 1998-06-02 | 1 | > +-------------+---------+ > {code} > Correct Result : > {code} > select l_shipdate, `month`(l_shipdate) from cp.`tpch/lineitem.parquet` where l_shipdate = date '1998-06-02' limit 2; > +-------------+---------+ > | l_shipdate | EXPR$1 | > +-------------+---------+ > | 1998-06-02 | 6 | > | 1998-06-02 | 6 | > +-------------+---------+ > {code} > It looks like we are getting wrong results when the 'day' is '01'. I only tried month and day hive functions....but wouldn't be surprised if they have similar issues too. -- This message was sent by Atlassian JIRA (v6.4.14#64029)