arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cutler <>
Subject Error Sending ArrowRecordBatch from Java to Python
Date Tue, 08 Nov 2016 23:58:54 GMT
Hi Devs,

I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame
toPandas conversion and getting stuck with an invalid metadata size error
trying to send a simple ArrowRecordBatch created in Java over a socket to
Python.  The strategy so far is like this:

Java side:
- make a simple ArrowRecordBatch (1 field of Ints)
- create an ArrowWriter using a ByteArrayOutputStream
- call writer.writerRecordBatch() and writer.close() to write to a ByteArray
- send the ByteArray (framed with size) over a socket

Python side:
- read the ByteArray over the socket
- create an ArrowFileReader with the read bytes
- call reader.get_record_batch(0) to convert the bytes to a RecordBattch

This results in "pyarrow.error.ArrowException: Invalid: metadata size
invalid" and debugging shows the ArrowFileReader getting a metadata size of
269422093.  This is obviously way off, but it does seem to read some things
correctly like number of batches and offset.  Here is some debug output log output
16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6
16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104, body:
16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224
16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370

Arrow-cpp printouts
read length 370
num batches 1
metadata size 269422093
offset 136

>From what I can tell by looking through the code, it seems like Java uses
Flatbuffers to write the metadata, but I don't see the Cython side using it
to read back the metadata.

Should this work with the classes I'm uses on both sides? or am I way off
with the above strategy?  I made a simplified example that mimics the Spark
integration and will reproduce the error, here is the gist

Sorry for the barrage of info, but any help would be much appreciated!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message