arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Zheng...@caudate.me>
Subject Re: Using 'zero copy' for interop with python from java
Date Sat, 13 Jun 2020 06:34:46 GMT
Hi Micah,

Thanks for the fantastic summary of what to do.

I’ll have a play with it in the next few weeks. 

Will keep you posted.

Chris

> On 12 Jun 2020, at 2:05 pm, Micah Kornfield <emkornfield@gmail.com> wrote:
> 
> Hi Chris,
> There isn't anything prepackaged for this use-case as far as I know.  As Uwe mentioned
it would probably be nice to build something using the C interface for this purpose, but I
think you should be able to do it today as described below.
> 
> I think you can pass ArrowBuf pointers to python via foreign_buffer [1], but as far as
I know, you would probably have to do some amount manual reconstructions of arrays from buffers.
 The rough steps would be:
> 1.  Serialize the schema on the java side side [2] and obtain a memory address from it
to share with python (via foreign_buffer) .  
> 2.  Deserialize the schema on the python side using pyarrow.ipc.read_schema [3] 
> 3.  Extract the buffer address/lengths in java (example from Gandiva [4]) and reconstruct
with foreign_object
> 4.  Traverse DataTypes the pyarrow schema to reconstruct the arrays [5] based on number
of buffers required [6]. 
> 
> If you do end up doing this, then I think #4 might make a nice contribution to the project.
> 
> Thanks,
> Micah
> 
> [1] https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html <https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html>
> [2] https://arrow.apache.org/docs/java/org/apache/arrow/vector/ipc/message/MessageSerializer.html#serializeMetadata-org.apache.arrow.vector.types.pojo.Schema
<https://arrow.apache.org/docs/java/org/apache/arrow/vector/ipc/message/MessageSerializer.html#serializeMetadata-org.apache.arrow.vector.types.pojo.Schema>
> [3] https://github.com/apache/arrow/blob/1164079d5442c3910c18549bfcd2e68d4554b909/python/pyarrow/ipc.pxi#L577
<https://github.com/apache/arrow/blob/1164079d5442c3910c18549bfcd2e68d4554b909/python/pyarrow/ipc.pxi#L577>
> [4] https://github.com/apache/arrow/blob/17bdb5af9b3c63f6cbef57e88a6d2513e781b532/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java#L139
<https://github.com/apache/arrow/blob/17bdb5af9b3c63f6cbef57e88a6d2513e781b532/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java#L139>
<https://github.com/apache/arrow/blob/17bdb5af9b3c63f6cbef57e88a6d2513e781b532/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java#L139>
> [5] https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.from_buffers
<https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.from_buffers>
> [6] https://arrow.apache.org/docs/python/generated/pyarrow.DataType.html#pyarrow.DataType.num_buffers
<https://arrow.apache.org/docs/python/generated/pyarrow.DataType.html#pyarrow.DataType.num_buffers>
> 
> 
> On Mon, Jun 8, 2020 at 12:55 AM Chris Zheng <z@caudate.me <mailto:z@caudate.me>>
wrote:
> That blog post is really good. However, I’d like to do this in a running JVM as opposed
to a python program.
> 
> 
>> On 8 Jun 2020, at 11:24 am, Micah Kornfield <emkornfield@gmail.com <mailto:emkornfield@gmail.com>>
wrote:
>> 
>> Uwe wrote a blog post [1] on how to do this with PY4J a while ago. I think this ends
up being zero copy but not 100% sure.  
>> 
>> [1] https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
<https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html>


Mime
View raw message