arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cutler <cutl...@gmail.com>
Subject Re: Error Sending ArrowRecordBatch from Java to Python
Date Wed, 09 Nov 2016 01:26:08 GMT
Thanks Wes!  I'd be happy to help out with the effort - I'll look into the
refs you mentioned.  I updated SPARK-13534 with what we have so far, but
not much has been done with the internal format conversion yet.

Bryan

On Tue, Nov 8, 2016 at 4:34 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

> Until we have integration tests proving otherwise, you can assume that
> the IPC wire/file formats are currently incompatible (not on purpose).
> Both Julien and I are working on corresponding efforts in Java and C++
> this week (e.g. see https://github.com/apache/arrow/pull/201 and
> ARROW-373), but it's going to be a little while until completed. We
> would certainly welcome help (including code review). This is a
> necessary step to moving forward with any hybrid Java/C++ Arrow
> applications.
>
> Also, can you please update SPARK-13534 when you have a WIP branch or
> to otherwise say that you've started work on the project (the issue
> isn't assigned to anyone)? I also have been looking into this with my
> colleagues at Two Sigma and I want to make sure we don't duplicate
> efforts. The general approach you've described makes sense to me.
>
> Thanks
> Wes
>
> On Tue, Nov 8, 2016 at 6:58 PM, Bryan Cutler <cutlerb@gmail.com> wrote:
> > Hi Devs,
> >
> > I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame
> > toPandas conversion and getting stuck with an invalid metadata size error
> > trying to send a simple ArrowRecordBatch created in Java over a socket to
> > Python.  The strategy so far is like this:
> >
> > Java side:
> > - make a simple ArrowRecordBatch (1 field of Ints)
> > - create an ArrowWriter using a ByteArrayOutputStream
> > - call writer.writerRecordBatch() and writer.close() to write to a
> ByteArray
> > - send the ByteArray (framed with size) over a socket
> >
> > Python side:
> > - read the ByteArray over the socket
> > - create an ArrowFileReader with the read bytes
> > - call reader.get_record_batch(0) to convert the bytes to a RecordBattch
> >
> > This results in "pyarrow.error.ArrowException: Invalid: metadata size
> > invalid" and debugging shows the ArrowFileReader getting a metadata size
> of
> > 269422093.  This is obviously way off, but it does seem to read some
> things
> > correctly like number of batches and offset.  Here is some debug output
> >
> > ArrowWriter.java log output
> > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6
> > 16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104,
> body:
> > 24
> > 16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224
> > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370
> >
> > Arrow-cpp printouts
> > read length 370
> > num batches 1
> > metadata size 269422093
> > offset 136
> >
> > From what I can tell by looking through the code, it seems like Java uses
> > Flatbuffers to write the metadata, but I don't see the Cython side using
> it
> > to read back the metadata.
> >
> > Should this work with the classes I'm uses on both sides? or am I way off
> > with the above strategy?  I made a simplified example that mimics the
> Spark
> > integration and will reproduce the error, here is the gist
> > https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321
> >
> > Sorry for the barrage of info, but any help would be much appreciated!
> >
> > Thanks,
> > Bryan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message