spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Cutler (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
Date Fri, 30 Jun 2017 18:06:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070475#comment-16070475
] 

Bryan Cutler edited comment on SPARK-13534 at 6/30/17 6:05 PM:
---------------------------------------------------------------

Hi [~jaiseban@gmail.com], the DataFrameWriter API is for persisting to disk which is not the
intent for Arrow since it is an in-memory format.  It would be possible in the future to add
an API to  expose internal data from a Spark Dataset as Arrow data that could be consumed
by another process.


was (Author: bryanc):
Hi [~jaiseban@gmail.com], the DataFrameWriter API is for persisting to disk which is not the
intent for Arrow since in is an in-memory format.  It would be possible in the future to add
an API to  expose internal data from a Spark Dataset as Arrow data that could be consumed
by another process.

> Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-13534
>                 URL: https://issues.apache.org/jira/browse/SPARK-13534
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: Wes McKinney
>            Assignee: Bryan Cutler
>             Fix For: 2.3.0
>
>         Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using PySpark passes
through an inefficient serialization-deserialiation process that I've examined at a high level
here: https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] objects are being
deserialized in pure Python as a list of tuples, which are then converted to pandas.DataFrame
using its {{from_records}} alternate constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout (https://github.com/apache/arrow/tree/master/format)
can be deserialized to {{pandas.DataFrame}} objects with comparatively small overhead compared
with memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, replacing the
corresponding null values with pandas's sentinel values (None or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting between Arrow
and pandas in the general case, so if Spark can send Arrow memory to PySpark, we will hopefully
be able to increase the Python data access throughput by an order of magnitude or more. I
propose to add an new serializer for Spark DataFrame and a new method that can be invoked
from PySpark to request a Arrow memory-layout byte stream, prefixed by a data header indicating
array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message