spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdeali Kothari (JIRA)" <>
Subject [jira] [Created] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns
Date Thu, 15 Nov 2018 03:44:00 GMT
Abdeali Kothari created SPARK-26067:

             Summary: Pandas GROUPED_MAP udf breaks if DF has >255 columns
                 Key: SPARK-26067
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.0, 2.3.2
            Reporter: Abdeali Kothari

When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in pythohn/pandas on a
grouped dataframe in spark - it fails if the number of columns is greater than 255 in Pytohn
3.6 and lower.

import pyspark
from pyspark.sql import types as T, functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
    [[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + str(i) for i in range(256)])

new_schema = T.StructType([
    field for field in df.schema] + [T.StructField("new_row", T.DoubleType())])

def myfunc(df):
    df['new_row'] = 1
    return df

myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc)

df2 = df.groupBy(["a1"]).apply(myfunc_udf)

print(df2.count())  # This FAILS
# Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
#   File "/usr/local/hadoop/spark2.3.1/python/lib/", line 219,
in main
#     func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
#   File "/usr/local/hadoop/spark2.3.1/python/lib/", line 148,
in read_udfs
#     mapper = eval(mapper_str, udfs)
#   File "<string>", line 1
# SyntaxError: more than 255 arguments

I believe thhis is happening because internally this creates a UDF with inputs as every column
in the DF.

Note: In Python 3.7 the 255 limit was raised, but I have not tried with Pytohn 3.7 ...

I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my Hadoop Linux
cluster and also on my Mac standalone spark installation.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message