spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei Zheng (JIRA)" <>
Subject [jira] [Commented] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF
Date Mon, 25 Sep 2017 22:40:01 GMT


Wei Zheng commented on SPARK-13301:

I recently came across a similar wrong result issue with Spark SQL running Hive's UDF. Just
want to put my 2 cents here: UDF execution in Hive is guaranteed to be single-threaded, so
thread safety is not an obvious issue when using Hive. But in Spark's case where multiple
tasks can run within a single JVM, if the UDF is not properly handling thread safety, wrong
result may appear.

> PySpark Dataframe return wrong results with custom UDF
> ------------------------------------------------------
>                 Key: SPARK-13301
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>         Environment: PySpark in yarn-client mode - CDH 5.5.1
>            Reporter: Simone
>            Priority: Critical
> Using a User Defined Function in PySpark inside the withColumn() method of Dataframe,
gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda s: string.lower(s))
>"col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> |                col1|       col2|                col3|
> |1265AB4F65C05740E...|        Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...|   Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example the third)
some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
>"col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
>"col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list each time all
column of your Dataframe.
> Also in Scala/Java the problems seems not occur.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message