arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frederick Reiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
Date Fri, 23 Sep 2016 23:35:20 GMT

    [ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517881#comment-15517881
] 

Frederick Reiss commented on ARROW-288:
---------------------------------------

Apologies for my delay in replying here; it's been a very hectic week.

Along the lines of what [~jacek@japila.pl] says above, I think it would be good to break this
overall task into smaller, bite-size chunks.

One top-level question that we'll need to answer before we can break things down properly:
Should we use Arrow's Java APIs or Arrow's C++ APIs to perform the conversion?

If we use the Java APIs to convert the data, then the "collect Dataset to Arrow" will go roughly
like this:
# Determine that the Spark Dataset can indeed be expressed in Arrow format.
# Obtain low-level access to the internal columnar representation of the Dataset.
# Convert Spark's columnar representation to Arrow using the Arrow Java APIs.
# Ship the Arrow buffer over the Py4j socket to the Python process as an array of bytes.
# Cast the array of bytes to a Python Arrow array.
All these steps will be contingent on Spark accepting a dependency on Arrow's Java API. This
last point might be a bit tricky, given that the API doesn't have any users right now. At
the least, we would need to break out some testing/documentation activities to create greater
confidence in the robustness of the Java APIs.

If we use Arrow's C++ API to do the conversion, the flow would go as follows:
# Determine that the Spark Dataset can be expressed in Arrow format
# Obtain low-level access to the internal columnar representation of the Dataset
# Ship chunks of column values over the Py4j socket to the Python process as arrays of primitive
types
# Insert the column values into an Arrow buffer on the Python side, using C++ APIs
Note that the last step here could potentially be implemented against Pandas dataframes instead
of Arrow as a short-term expedient.

A third possibility is to use Parquet as an intermediate format:
# Determine that the Spark Dataset can be expressed in Arrow format.
# Write the Dataset to a Parquet file in a location that the Python process can access.
# Read the Parquet file back into an Arrow buffer in the Python process using C++ APIs.
This approach would involve a lot less code, but it would of course require creating and deleting
temporary files.



> Implement Arrow adapter for Spark Datasets
> ------------------------------------------
>
>                 Key: ARROW-288
>                 URL: https://issues.apache.org/jira/browse/ARROW-288
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Java - Vectors
>            Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message