spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-23009) PySpark should not assume Pandas cols are a basestring type
Date Tue, 09 Jan 2018 20:04:02 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-23009:
------------------------------------

    Assignee: Apache Spark

> PySpark should not assume Pandas cols are a basestring type
> -----------------------------------------------------------
>
>                 Key: SPARK-23009
>                 URL: https://issues.apache.org/jira/browse/SPARK-23009
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>            Assignee: Apache Spark
>
> When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark
assumes that the columns will either be a {{str}} type or {{unicode}} type.  They can actually
be any type that a dict can key off of.  If they are not a {{basestr}} type, then a confusing
AttributeError is thrown:
> {noformat}
> In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))
> In [17]: pdf
> Out[17]: 
>           0         1
> 0  0.145171  0.482940
> 1  0.151336  0.299861
> 2  0.220338  0.830133
> 3  0.001659  0.513787
> In [18]: pdf.columns
> Out[18]: RangeIndex(start=0, stop=2, step=1)
> In [19]: df = spark.createDataFrame(pdf)
> ---------------------------------------------------------------------------
> AttributeError                            Traceback (most recent call last)
> <ipython-input-18-11bcb07e0e39> in <module>()
> ----> 1 df = spark.createDataFrame(pdf)
> /home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema,
samplingRatio, verifySchema)
>     646             # If no schema supplied by user then get the names of columns only
>     647             if schema is None:
> --> 648                 schema = [x.encode('utf-8') if not isinstance(x, str) else
x for x in data.columns]
>     649 
>     650             if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower()
== "true" \
> AttributeError: 'int' object has no attribute 'encode'
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message