spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
Date Tue, 06 Feb 2018 07:49:01 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353519#comment-16353519
] 

Apache Spark commented on SPARK-23290:
--------------------------------------

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20515

> inadvertent change in handling of DateType when converting to pandas dataframe
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-23290
>                 URL: https://issues.apache.org/jira/browse/SPARK-23290
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Andre Menck
>            Priority: Blocker
>
> In [this PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] there
was a change in how `DateType` is being returned to users (line 1968 in dataframe.py). This
can cause client code to fail, as in the following example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 0    2015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> date    datetime64[ns]
> num              int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", line
2355, in apply
>     mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer
>   File "<stdin>", line 1, in <lambda>
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new behavior
(returning a datetime column). Since there may be user code relying on the old behavior, I'd
suggest reverting this specific part of this change. Also note that the NOTE on the docstring
for the "_to_corrected_pandas_type" seems to be off, referring to the old behavior and not
the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message