spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name
Date Tue, 10 Apr 2018 01:28:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431584#comment-16431584
] 

Hyukjin Kwon commented on SPARK-23929:
--------------------------------------

I think we already use positional-based approach and released out ..

> pandas_udf schema mapped by position and not by name
> ----------------------------------------------------
>
>                 Key: SPARK-23929
>                 URL: https://issues.apache.org/jira/browse/SPARK-23929
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: PySpark
> Spark 2.3.0
>  
>            Reporter: Omri
>            Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by name. Currently
it's not the case.
> Consider these two examples, where the only change is the order of the fields in the
provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+-------------------+
> | id|                  v|
> +---+-------------------+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+-------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message