spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
Date Mon, 13 Jun 2016 03:04:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326762#comment-15326762
] 

Josh Rosen edited comment on SPARK-11415 at 6/13/16 3:03 AM:
-------------------------------------------------------------

On the other hand, it seems clear that for a single JVM there should only be one canonical
{{Date}} for a given date and presumably {{Date.valueOf}} is that date.

According to the Javadocs for java.sql.Date:

{quote}
A thin wrapper around a millisecond value that allows JDBC to identify this as an SQL DATE
value. A milliseconds value represents the number of milliseconds that have passed since January
1, 1970 00:00:00.000 GMT.

To conform with the definition of SQL DATE, the millisecond values wrapped by a java.sql.Date
instance must be 'normalized' by setting the hours, minutes, seconds, and milliseconds to
zero in the particular time zone with which the instance is associated.
{quote}

I guess that one interpretation of this is that we need to truncate timestamps to the time-zone-offset-shifted
midnight of that day.


was (Author: joshrosen):
On the other hand, it seems clear that for a single JVM there should only be one canonical
{{Date}} for a given date and presumably {{Date.valueOf}} is that date.

According to the Javadocs for java.sql.Date:

{quote}
A thin wrapper around a millisecond value that allows JDBC to identify this as an SQL DATE
value. A milliseconds value represents the number of milliseconds that have passed since January
1, 1970 00:00:00.000 GMT.

To conform with the definition of SQL DATE, the millisecond values wrapped by a java.sql.Date
instance must be 'normalized' by setting the hours, minutes, seconds, and milliseconds to
zero in the particular time zone with which the instance is associated.
{code}

I guess that one interpretation of this is that we need to truncate timestamps to the time-zone-offset-shifted
midnight of that day.

> Catalyst DateType Shifts Input Data by Local Timezone
> -----------------------------------------------------
>
>                 Key: SPARK-11415
>                 URL: https://issues.apache.org/jira/browse/SPARK-11415
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.5.0, 1.5.1
>            Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't get a consistent
result for java.sql.Date. I investigated and noticed the following code is used to create
Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>    * Returns the number of days since epoch from from java.sql.Date.
>    */
>   def fromJavaDate(date: Date): SQLDate = {
>     millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying timestamp to
the local timezone before calculating the days from epoch. This causes the invocation to move
the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> day
>   def millisToDays(millisUtc: Long): SQLDate = {
>     // SPARK-6785: use Math.floor so negative number of days (dates before 1970)
>     // will correctly work as input for function toJavaDate(Int)
>     val millisLocal = millisUtc + threadLocalLocalTimeZone.get().getOffset(millisUtc)
>     Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
>     val millisUtc = days.toLong * MILLIS_PER_DAY
>     millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if the underlying
data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message