spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takuya Ueshin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24752) date_format provides incorrect year after a timezone conversation changes the year on a timestamp
Date Mon, 09 Jul 2018 02:52:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536502#comment-16536502
] 

Takuya Ueshin commented on SPARK-24752:
---------------------------------------

Maybe we should use {{"yyyy-MM-dd"}} for the format string instead of {{"YYYY-MM-dd"}}?

{code}
scala> val result = central_ts.withColumn("observation_date", date_format($"central_timestamp",
"yyyy-MM-dd")).withColumn("observation_time", date_format($"central_timestamp", "HH:mm"))
scala> result.select("observation_date","observation_time","central_timestamp","UTC").show()
+----------------+----------------+-------------------+----------------+
|observation_date|observation_time|  central_timestamp|             UTC|
+----------------+----------------+-------------------+----------------+
|      2015-12-31|           20:22|2015-12-31 20:22:00|2016-01-01 02:22|
|      2016-01-01|           02:22|2016-01-01 02:22:00|2016-01-01 08:22|
+----------------+----------------+-------------------+----------------+
{code}

https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html
https://docs.oracle.com/javase/8/docs/api/java/util/GregorianCalendar.html#week_year

> date_format provides incorrect year after a timezone conversation changes the year on
a timestamp
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24752
>                 URL: https://issues.apache.org/jira/browse/SPARK-24752
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, SQL
>    Affects Versions: 2.1.2, 2.4.0
>            Reporter: Noah Lidell
>            Priority: Minor
>
> Initially I have a dataframe with a column for date and a column for time in UTC time.
I combine these two columns into a single string that represents a UTC timestamp. I then convert
that UTC timestamp into a Central Time timestamp using pyspark.sql.functions.from_utc_timestamp().
In the example below I have timestamps that are early in the day on Janurary 1st UTC, so that when
they are converted to Central Time the resulting timestamp is on December 31st of the previous
year. These timestamps are produced correctly. However, if I then use pyspark.sql.functions.date_format()
to format a date from the one of these December 31st Central Time timestamps then the year
portion of the output will be wrong. 
>  
>  
> {code:java}
> import pyspark.sql.functions as F
> columns = ["observation_date","observation_time"]
> vals = [
> ('2015-01-01', '02:22'),
> ('2016-01-01', '02:22'),
> ('2016-01-01', '08:22'),
> ('2016-01-02', '02:22'),
> ('2017-01-01', '02:22'),
> ('2018-01-01', '02:22'),
> ('2019-01-01', '02:22'),
> ]
> test_df = spark.createDataFrame(vals, columns)
> test_df.show()
> # OUTPUT
> # +----------------+----------------+
> # |observation_date|observation_time|
> # +----------------+----------------+
> # |      2015-01-01|           02:22|
> # |      2016-01-01|           02:22|
> # |      2016-01-01|           08:22|
> # |      2016-01-02|           02:22|
> # |      2017-01-01|           02:22|
> # |      2018-01-01|           02:22|
> # |      2019-01-01|           02:22|
> # +----------------+----------------+
> renamed = test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")
> utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), F.lit(" "),
F.col("UTC_observation_time")))
> utc_ts.show()
> # OUTPUT
> # +--------------------+--------------------+----------------+
> # |UTC_observation_date|UTC_observation_time|             UTC|
> # +--------------------+--------------------+----------------+
> # |          2015-01-01|               02:22|2015-01-01 02:22|
> # |          2016-01-01|               02:22|2016-01-01 02:22|
> # |          2016-01-01|               08:22|2016-01-01 08:22|
> # |          2016-01-02|               02:22|2016-01-02 02:22|
> # |          2017-01-01|               02:22|2017-01-01 02:22|
> # |          2018-01-01|               02:22|2018-01-01 02:22|
> # |          2019-01-01|               02:22|2019-01-01 02:22|
> # +--------------------+--------------------+----------------+
> central_ts = utc_ts.withColumn("central_timestamp", F.from_utc_timestamp(utc_ts.UTC,
'US/Central'))
> final = central_ts.withColumn('observation_date', F.date_format(central_ts.central_timestamp,
"YYYY-MM-dd"))\
> .withColumn('observation_time', F.date_format(central_ts.central_timestamp, "HH:mm"))
> final.select('observation_date','observation_time','central_timestamp','UTC').show()
> # OUTPUT
> # +----------------+----------------+--------------------+----------------+
> # |observation_date|observation_time|   central_timestamp|             UTC|
> # +----------------+----------------+--------------------+----------------+
> # |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
> # |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
> # |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
> # |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
> # |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
> # |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
> # |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
> # +----------------+----------------+--------------------+----------------+{code}
> The Central Time timestamp is generated correctly by the from_utc_timestamp function
but when formatted by the date_format function _and if_ the timezone conversation caused a
change in year then the formatted string produced by the date_format function will have the
wrong year.
> In summary, this is the incorrect behavior:
> {code:java}
> UTC timestamp="2016-01-01 02:22" 
> --(apply from_utc_timestamp)--> Central timestamp="2015-12-31 20:22" 
> --(apply date_format "YYYY-MM-dd")--> date="2016-12-31"{code}
> output instead should be "2015-12-31"
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message