From reviews-return-936277-archive-asf-public=cust-asf.ponee.io@spark.apache.org Tue Oct 8 15:18:01 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8ADB8180654 for ; Tue, 8 Oct 2019 17:18:01 +0200 (CEST) Received: (qmail 45560 invoked by uid 500); 8 Oct 2019 15:18:01 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 45548 invoked by uid 99); 8 Oct 2019 15:18:00 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Oct 2019 15:18:00 +0000 From: GitBox To: reviews@spark.apache.org Subject: [GitHub] [spark] MaxGekk commented on a change in pull request #25981: [SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()` Message-ID: <157054788077.14201.3090716633509563377.gitbox@gitbox.apache.org> Date: Tue, 08 Oct 2019 15:18:00 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit MaxGekk commented on a change in pull request #25981: [SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()` URL: https://github.com/apache/spark/pull/25981#discussion_r332571726 ########## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala ########## @@ -2067,6 +2082,10 @@ object DatePart { 224 > SELECT _FUNC_('SECONDS', timestamp'2019-10-01 00:00:01.000001'); 1.000001 + > SELECT _FUNC_('days', interval 1 year 10 months 5 days); Review comment: > Sometimes the entire interval's length is return in the given unit ... The entire interval's length is returned for `EPOCH` only, and components for other `field` values. > I wonder if there is any other standard to look at. I like java 8 approach where there are 2 types `java.time.Duration` and `java.time.Period`. The first one is to store interval duration in nanosecond precision, the second one is to store components `year`, `months`, `days`. I would reuse this model and extend the types slightly: - `DURATION` type is to store interval duration in microseconds. `long` should be enough to store difference between any supported timestamps. - `PERIOD` type should store `years`, `months`, `days`, `hours`, `minutes`, `seconds`, `milliseconds` and `microseconds`. For example, (-10 years, 5 months, -3 hours, 100 microseconds). > I'm mostly concerned with having consistent semantics, whatever they are. If it can't be done reasonably, hm, should we implement this? I don't know. From my point of semantic of each extraction field is well defined. There is a difference in implementation of this PR and PostgreSQL. As I wrote above Spark store interval in 2 variables - `months` and `microseconds`, but PostgreSQL uses 3 vars - `month`, `day` and `time` (in microseconds). In this way, `days` are independent from other components. And as consequence, `hours` are not limited by `[0, 24)`: ```sql maxim=# SELECT interval '-100 years 100 days -100 hours'; interval --------------------------------- -100 years +100 days -100:00:00 (1 row) ``` > I'm trying to figure out the use case for "a date part of an interval". I can only think of cases where the interval should be converted entirely into some unit. I can image at least 2 use cases: - Adjusting some of timestamp/date components. For example, we have `timestamp '2019-10-08 10:11:12.123456'`, and we want to adjust it by `-1 year 10 hours`. As result, we will have `timestamp '2018-10-08 20:11:12.123456'`. The `PERIOD` type could be useful here. - When we need `Ordered` (and comparable) intervals. We can calculate absolute interval duration in some units. For example, in the query: ```sql SELECT name, place FROM t1, t2 WHERE t1.sellTimestamp - t2.deliveredTimestamp < interval 1 month 15 days; ``` Here, the `DURATION` should be used. We cannot use `PERIOD` because its values cannot be ordered. Spark's CalendarIntervalType cannot be used here too. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org