arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Murray <rym...@dremio.com>
Subject Re: Python and Java interoperability
Date Tue, 21 Jul 2020 12:50:46 GMT
Hey Jiaxing,

You want to use the IPC mechanism to pass arrow buffers between languages[1]

First get a buffer:
```
import pyarrow as pa

data = [
    pa.array([1, 2, 3, 4]),
    pa.array(['foo', 'bar', 'baz', None]),
    pa.array([True, None, False, True])
]
batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, batch.schema)
writer.write_batch(batch)
writer.close()
buf = sink.getvalue()
```

The buffer could be written to Redis, to a file etc. For redis I think
`r.set("key", buf.hex())` is easiest, you don't have to worry about
encoding.

On the java side something like:
```
    Jedis jedis = new Jedis();
    String buf = jedis.get("key");
    RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE);
    ByteArrayInputStream in = new
ByteArrayInputStream(hexStringToByteArray(buf));
    ArrowStreamReader stream = new ArrowStreamReader(in, rootAllocator);
    VectorSchemaRoot vsr = stream.getVectorSchemaRoot();
    stream.loadNextBatch()
```
And the VectorSchemaRoot holds the correct Arrow Buffer.

While Redis will work for this you might find a file or socket a bit more
ergonomic in Arrow. The Plasma object store is also an option[2] which you
can think of as a primitive Redis specifically for Arrow Buffers. Finally,
if you are using Redis as a message bus you might find the Arrow RPC
mechanism Arrow Flight is a good choice[3].

[1]
https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams
[2] https://arrow.apache.org/docs/python/plasma.html
[3] https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/

On Tue, Jul 21, 2020 at 10:57 AM Jesse Wang <hello.wjx@gmail.com> wrote:

> Hi,
> I want to have a Java process read the content of DataFrames produced by a
> Python process. The Java and Python processes run on different hosts.
>
> The solution I can think of is to have the Python process serialize the
> DataFrame and save it to redis, and have the Java process parse the data.
>
> The solution I find serializes the DataFrame to 'pybytes':
> (from
> https://stackoverflow.com/questions/57949871/how-to-set-get-pandas-dataframes-into-redis-using-pyarrow
> )
> ```
>    import pandas as pd
>
> import pyarrow as paimport redis
>
> df=pd.DataFrame({'A':[1,2,3]})
> r = redis.Redis(host='localhost', port=6379, db=0)
>
> context = pa.default_serialization_context()
> r.set("key", context.serialize(df).to_buffer().to_pybytes())
> context.deserialize(r.get("key"))
>    A0  11  22  3
>
> ```
>
> I wonder if this serialized 'pybytes' can be parsed at the Java end? If
> not, how can I achieve this properly?
>
> Thanks!
>
> --
>
> Best Regards,
> Jiaxing Wang
>
>

Mime
View raw message