spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Cutler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection
Date Tue, 25 Jul 2017 17:49:01 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100434#comment-16100434
] 

Bryan Cutler commented on SPARK-21375:
--------------------------------------

Thanks for the details [~wesmckinn].  The approach that Arrow uses makes sense to me, but
as far as I know there is no way for Spark to create time zone naive timestamps, please correct
me if I'm wrong [~cloud_fan] [~ueshin].  When creating a {Dataset} with a {{TimestampType}}
that does not specify a time zone, Spark will always assume it is from {{DateTimeUtils.defaultTimeZone()}}
which corresponds to System time zone.  In the PR for this we are discussing what time zone
to use, which will be used in the Arrow data

1. Force "UTC"
Spark SQL has timestamp value as the number of micros since 1970-01-01 00:00:00.0 UTC internally.

2. {{SQLConf.SESSION_LOCAL_TIMEZONE}}
Spark SQL represents and calculates in timezone related operations using this timezone. If
there isn't the config value, the value will fallback to DateTimeUtils.defaultTimeZone().

3. {{DateTimeUtils.defaultTimeZone()}}
The system timezone.

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-21375
>                 URL: https://issues.apache.org/jira/browse/SPARK-21375
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using ArrowConverters.
 These are common types for data analysis used in both Spark and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, without
timezone specified, internally.  PySpark takes a UTC timestamp that is adjusted to local time
and Arrow is in UTC time.  Hopefully there is a clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message