spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] MaxGekk commented on a change in pull request #25981: [SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()`
Date Tue, 08 Oct 2019 15:18:00 GMT
MaxGekk commented on a change in pull request #25981: [SPARK-28420][SQL] Support the `INTERVAL`
type in `date_part()`
URL: https://github.com/apache/spark/pull/25981#discussion_r332571726
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ##########
 @@ -2067,6 +2082,10 @@ object DatePart {
        224
       > SELECT _FUNC_('SECONDS', timestamp'2019-10-01 00:00:01.000001');
        1.000001
+      > SELECT _FUNC_('days', interval 1 year 10 months 5 days);
 
 Review comment:
   > Sometimes the entire interval's length is return in the given unit ...
   
   The entire interval's length is returned for `EPOCH` only, and components for other `field`
values.
   
   > I wonder if there is any other standard to look at.
   
   I like java 8 approach where there are 2 types `java.time.Duration` and `java.time.Period`.
The first one is to store interval duration in nanosecond precision, the second one is to
store components `year`, `months`, `days`. I would reuse this model and extend the types slightly:
   - `DURATION` type is to store interval duration in microseconds. `long` should be enough
to store difference between any supported timestamps.
   - `PERIOD` type should store `years`, `months`, `days`, `hours`, `minutes`, `seconds`,
`milliseconds` and `microseconds`. For example, (-10 years, 5 months, -3 hours, 100 microseconds).
   
   > I'm mostly concerned with having consistent semantics, whatever they are. If it can't
be done reasonably, hm, should we implement this?
   
   I don't know. From my point of semantic of each extraction field is well defined. There
is a difference in implementation of this PR and PostgreSQL. As I wrote above Spark store
interval in 2 variables - `months` and `microseconds`, but PostgreSQL uses 3 vars - `month`,
`day` and `time` (in microseconds). In this way, `days` are independent from other components.
And as consequence, `hours` are not limited by `[0, 24)`:
   ```sql
   maxim=# SELECT interval '-100 years 100 days -100 hours'; 
               interval             
   ---------------------------------
    -100 years +100 days -100:00:00
   (1 row)
   ```
   
   > I'm trying to figure out the use case for "a date part of an interval". I can only
think of cases where the interval should be converted entirely into some unit.
   
   I can image at least 2 use cases:
   - Adjusting some of timestamp/date components. For example, we have `timestamp '2019-10-08
10:11:12.123456'`, and we want to adjust it by `-1 year 10 hours`. As result, we will have
`timestamp '2018-10-08 20:11:12.123456'`. The `PERIOD` type could be useful here.
   - When we need `Ordered` (and comparable) intervals. We can calculate absolute interval
duration in some units. For example, in the query:
   ```sql
   SELECT name, place
   FROM t1, t2
   WHERE t1.sellTimestamp - t2.deliveredTimestamp < interval 1 month 15 days;
   ```
   Here, the `DURATION` should be used. We cannot use `PERIOD` because its values cannot be
ordered. Spark's CalendarIntervalType cannot be used here too. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message