spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-20791) Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
Date Mon, 13 Nov 2017 04:17:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-20791.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0

Issue resolved by pull request 19459
[https://github.com/apache/spark/pull/19459]

> Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
> -----------------------------------------------------------------------
>
>                 Key: SPARK-20791
>                 URL: https://issues.apache.org/jira/browse/SPARK-20791
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.1.1
>            Reporter: Bryan Cutler
>             Fix For: 2.3.0
>
>
> The current code for creating a Spark DataFrame from a Pandas DataFrame uses `to_records`
to convert the DataFrame to a list of records and then converts each record to a list.  Following
this, there are a number of calls to serialize and transfer this data to the JVM.  This process
is very inefficient and also discards all schema metadata, requiring another pass over the
data to infer types.
> Using Apache Arrow, the Pandas DataFrame could be efficiently converted to Arrow data
and directly transferred to the JVM to create the Spark DataFrame.  The performance will be
better and the Pandas schema will also be used so that the correct types will be used.  
> Issues with the poor type inference have come up before, causing confusion and frustration
with users because it is not clear why it fails or doesn't use the same type from Pandas.
 Fixing this with Apache Arrow will solve another pain point for Python users and the following
JIRAs could be closed:
> * SPARK-17804
> * SPARK-18178



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message