spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andre Menck (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
Date Fri, 02 Feb 2018 20:29:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350917#comment-16350917
] 

Andre Menck commented on SPARK-23290:
-------------------------------------

Hey [~ueshin] apologies, I tried to come up with a simpler example of the failure I saw, and
ended up with an incorrect one! Here is a more straight-forward example of the failure in
2.3, specifically due to joining on columns with different (but similar) types:
{code}
>>> pdf = df.toPandas()
>>> pdf.dtypes
date    datetime64[ns]
num              int64
dtype: object
>>> type(pdf['date'][0])
<class 'pandas._libs.tslib.Timestamp'>
>>> user_provided_pdf
         date  num
0  2015-01-01    1
>>> user_provided_pdf.dtypes
date    object
num      int64
dtype: object
>>> type(user_provided_pdf['date'][0])
<type 'datetime.date'>
{code}
At this point, a simple example of change in functionality would be checking equality:
{code}
>>> pdf.loc[0,'date'] == user_provided_pdf.loc[0,'date']
False
{code}
In reality, I hit this when executing a join with a pandas dataframe obtained from another
source:
{code}
>>> pdf.merge(user_provided_pdf, on=['date'], how='inner')
Empty DataFrame
Columns: [date, num_x, num_y]
Index: []
{code}
For 2.2, the equality above would hold and this join would produce a non-trivial output.

> inadvertent change in handling of DateType when converting to pandas dataframe
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-23290
>                 URL: https://issues.apache.org/jira/browse/SPARK-23290
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Andre Menck
>            Priority: Blocker
>
> In [this PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] there
was a change in how `DateType` is being returned to users (line 1968 in dataframe.py). This
can cause client code to fail, as in the following example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 0    2015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> date    datetime64[ns]
> num              int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", line
2355, in apply
>     mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer
>   File "<stdin>", line 1, in <lambda>
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new behavior
(returning a datetime column). Since there may be user code relying on the old behavior, I'd
suggest reverting this specific part of this change. Also note that the NOTE on the docstring
for the "_to_corrected_pandas_type" seems to be off, referring to the old behavior and not
the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message