From user-return-542-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Jul 22 05:49:43 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 90EA5180643 for ; Wed, 22 Jul 2020 07:49:43 +0200 (CEST) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id 7666C125BB6 for ; Wed, 22 Jul 2020 05:49:07 +0000 (UTC) Received: (qmail 17101 invoked by uid 500); 22 Jul 2020 05:49:04 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 17086 invoked by uid 99); 22 Jul 2020 05:49:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jul 2020 05:49:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D699D1A32A2 for ; Wed, 22 Jul 2020 05:49:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.01 X-Spam-Level: X-Spam-Status: No, score=0.01 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 3idR-RKCAmsE for ; Wed, 22 Jul 2020 05:49:02 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.166.51; helo=mail-io1-f51.google.com; envelope-from=hello.wjx@gmail.com; receiver= Received: from mail-io1-f51.google.com (mail-io1-f51.google.com [209.85.166.51]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 0DC99BB94A for ; Wed, 22 Jul 2020 05:49:02 +0000 (UTC) Received: by mail-io1-f51.google.com with SMTP id l1so1285656ioh.5 for ; Tue, 21 Jul 2020 22:49:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=ukUGVG1o6TTk9FkNmNH1UG3swP4Z6H38U8mVKtu4VG8=; b=cKB+q19QNSjDm1jEymXxFkt/WUWaT8NJ170FOJOO99foBGkaC2BfBvak1lWLsDjoHJ 8hYTrmr2IKhi/6XDCQTBkGzrA1QLVLe+AMm5FONp3Md7VHyYe5fBIUT9z/Qw/9nfH2Mx jkNUtjPP0bRtJuwARj3aKEW7H4ymQEen8T5OhagF8JPycErIEsAhLc/tBmosYqWXjWHT 1U+WxpLqonu/qjlcLQmAt7vd9zD8XL3xBYHd3zuL2s6QwrE/KRCJBlVr7mf7u7yCI7Jr uBQm+fYZzoyvmgaAbNvCtugER5PvLo1PZADttR8VmB5mXLkIUcDCCiQjkHUsn6DV309s 4ZlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ukUGVG1o6TTk9FkNmNH1UG3swP4Z6H38U8mVKtu4VG8=; b=YwvDL6/DFajALAsGETMnFTTyNVfKOR1VdBoSYxBB+d3hfQxF8NND0xxGWQGer53rdZ M4oAuCY1D9R6X5X57xPzVB7QeFHd7m6YAtlVexfOEG9rbJ0lyzA4Z0IJqIuIWHFZ1dqc a/42U1DzjiqvNzPbl3UHzps6PHxSJAEDyALs+1jNBHgEaD+D8lkn+imexch86YA+AXWW Ae4drmrXvDEwuQx25tET9/+gbR5aJlTpg/Hm+wXXVpVwY3z+HtG1LIc9WATX4e8HEqEj zQEEM2uO9A9WleKsi2PCccwqGPqj3kaYpTY512r0KWQK8UZZNWXRsPBIoRTCR76hYRb8 rlog== X-Gm-Message-State: AOAM533f9DEGD0lXXyTZMYhIIon6fs2v56xdY7hkQaFx7/iMsjlaY9XM z5cRnQaV1s56PAExWjPhZ5pw8Y0eHZE4LnxYI2UQy+yIQR0= X-Google-Smtp-Source: ABdhPJz3b6e4rhzPVnBQyzfdF7i3SZvKpCCjg+is/ig8UAPQBLiGL1bNAFylCCW0teXw6yXtrJEf7HP2Q6NxpeyHy2M= X-Received: by 2002:a02:108a:: with SMTP id 132mr36903690jay.131.1595396941205; Tue, 21 Jul 2020 22:49:01 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jesse Wang Date: Wed, 22 Jul 2020 13:48:50 +0800 Message-ID: Subject: Re: Python and Java interoperability To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000f4d54b05ab014b32" --000000000000f4d54b05ab014b32 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ryan, Thanks for your reply. On Tue, Jul 21, 2020 at 8:54 PM Ryan Murray wrote: > Hey Jiaxing, > > You want to use the IPC mechanism to pass arrow buffers between > languages[1] > > First get a buffer: > ``` > import pyarrow as pa > > data =3D [ > pa.array([1, 2, 3, 4]), > pa.array(['foo', 'bar', 'baz', None]), > pa.array([True, None, False, True]) > ] > batch =3D pa.record_batch(data, names=3D['f0', 'f1', 'f2']) > sink =3D pa.BufferOutputStream() > writer =3D pa.ipc.new_stream(sink, batch.schema) > writer.write_batch(batch) > writer.close() > buf =3D sink.getvalue() > ``` > > The buffer could be written to Redis, to a file etc. For redis I think > `r.set("key", buf.hex())` is easiest, you don't have to worry about > encoding. > > On the java side something like: > ``` > Jedis jedis =3D new Jedis(); > String buf =3D jedis.get("key"); > RootAllocator rootAllocator =3D new RootAllocator(Long.MAX_VALUE); > ByteArrayInputStream in =3D new > ByteArrayInputStream(hexStringToByteArray(buf)); > ArrowStreamReader stream =3D new ArrowStreamReader(in, rootAllocator)= ; > VectorSchemaRoot vsr =3D stream.getVectorSchemaRoot(); > stream.loadNextBatch() > ``` > And the VectorSchemaRoot holds the correct Arrow Buffer. > I tested this and get the following exception thrown: ``` Exception in thread =E2=80=9Cmain=E2=80=9D java.lang.IllegalArgumentExcepti= on at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSe= rializer.java:547) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageCh= annelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.= java:132) at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:178) at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:= 169) at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.jav= a:62) at Main.main(Main.java:38) ``` > While Redis will work for this you might find a file or socket a bit more > ergonomic in Arrow. The Plasma object store is also an option[2] which yo= u > can think of as a primitive Redis specifically for Arrow Buffers. Finally= , > if you are using Redis as a message bus you might find the Arrow RPC > mechanism Arrow Flight is a good choice[3]. > As for the Plasma, It seems it currently limited to a single host... (From its source code arrow/cpp/src/plasma/io.cc, it used AF_UNIX socket only) > > [1] > https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams > [2] https://arrow.apache.org/docs/python/plasma.html > [3] https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ > > On Tue, Jul 21, 2020 at 10:57 AM Jesse Wang wrote: > >> Hi, >> I want to have a Java process read the content of DataFrames produced by >> a Python process. The Java and Python processes run on different hosts. >> >> The solution I can think of is to have the Python process serialize the >> DataFrame and save it to redis, and have the Java process parse the data= . >> >> The solution I find serializes the DataFrame to 'pybytes': >> (from >> https://stackoverflow.com/questions/57949871/how-to-set-get-pandas-dataf= rames-into-redis-using-pyarrow >> ) >> ``` >> import pandas as pd >> >> import pyarrow as paimport redis >> >> df=3Dpd.DataFrame({'A':[1,2,3]}) >> r =3D redis.Redis(host=3D'localhost', port=3D6379, db=3D0) >> >> context =3D pa.default_serialization_context() >> r.set("key", context.serialize(df).to_buffer().to_pybytes()) >> context.deserialize(r.get("key")) >> A0 11 22 3 >> >> ``` >> >> I wonder if this serialized 'pybytes' can be parsed at the Java end? If >> not, how can I achieve this properly? >> >> Thanks! >> >> -- >> >> Best Regards, >> Jiaxing Wang >> >> > --=20 Best Regards, Jiaxing Wang --000000000000f4d54b05ab014b32 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
=C2=A0Hi=C2=A0Ryan,
Thanks for your reply.<= /div>

On Tue, Jul 21, 2020 at 8:54 PM Ryan Murray <rymurr@dremio.com> wrote:
Hey Jiaxing,

You wan= t to use the IPC mechanism to pass arrow buffers between languages[1]
=

First get a buffer:
```
import pyarrow as = pa

data =3D [
=C2=A0 =C2=A0 pa.array([1, 2, 3, 4]),
=C2=A0 =C2= =A0 pa.array(['foo', 'bar', 'baz', None]),
=C2= =A0 =C2=A0 pa.array([True, None, False, True])
]
batch =3D= pa.record_batch(data, names=3D['f0', 'f1', 'f2'])<= br>
sink =3D pa.BufferOutputStream()
writer =3D pa.ipc.new_str= eam(sink, batch.schema)
writer.write_batch(batch)
writer.c= lose()
buf =3D sink.getvalue()
```
The buffer could be written to Redis, to a file etc. For redis= I think `r.set("key", buf.hex())` is easiest, you don't have= to worry about encoding.

On the java side somethi= ng like:
```
=C2=A0 =C2=A0 Jedis jedis =3D new Jedis();=
=C2=A0 =C2=A0 String buf =3D jedis.get("key");
=C2=A0 =C2= =A0 RootAllocator rootAllocator =3D new RootAllocator(Long.MAX_VALUE);
= =C2=A0 =C2=A0 ByteArrayInputStream in =3D new ByteArrayInputStream(hexStrin= gToByteArray(buf));
=C2=A0 =C2=A0 ArrowStreamReader stream =3D new Arrow= StreamReader(in, rootAllocator);
=C2=A0 =C2=A0 VectorSchemaRoot vsr =3D = stream.getVectorSchemaRoot();
=C2=A0 =C2=A0 stream.loadNextBatch()
=
```
And the VectorSchemaRoot holds the correct Arrow Buffer.=

I teste= d=C2=A0this and get the following exception thrown:
```
Exception in thread =E2=80=9Cmain=E2=80=9D java.lang.IllegalArgumentExcept= ion
=C2=A0 at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
=C2= =A0 at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(Me= ssageSerializer.java:547)
=C2=A0 at org.apache.arrow.vector.ipc.message.= MessageChannelReader.readNext(MessageChannelReader.java:58)
=C2=A0 at or= g.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.ja= va:132)
=C2=A0 at org.apache.arrow.vector.ipc.ArrowReader.initialize(Arr= owReader.java:178)
=C2=A0 at org.apache.arrow.vector.ipc.ArrowReader.ens= ureInitialized(ArrowReader.java:169)
=C2=A0 at org.apache.arrow.vector.i= pc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:62)
=C2=A0 at Main.m= ain(Main.java:38)
```
=C2=A0
While Redis will work for this yo= u might find a file or socket a bit more ergonomic in Arrow. The Plasma obj= ect store is also an option[2] which you can think of as a primitive Redis = specifically for Arrow Buffers. Finally, if you are using Redis as a messag= e bus you might find the Arrow RPC mechanism Arrow Flight is a good choice[= 3].
As for the Plasma, It se= ems it currently limited to a single host... (From its source code arrow/cp= p/src/plasma/io.cc, it used AF_UNIX socket only)
=C2=A0

On Tue, Jul 21, 2020 at 10:57 AM Jesse Wang <hello.wjx@gmail.com> = wrote:
Hi,
I want to have a Java process read the content of Da= taFrames produced by a Python process. The Java and Python processes run=C2= =A0on different hosts.

The solution I can think = of is to have the Python process serialize the DataFrame and save it to red= is, and have the Java process parse the data.

The = solution I find serializes the DataFrame to 'pybytes':
```
=C2=A0 =C2=A0import=C2= =A0pandas=C2=A0as=C2=A0pd
import pyarrow as pa
impo=
rt r=
edis

df=
=3Dp=
d.DataFrame({'A':[1,2,3]})
r =
=3D =
redis.Redis(host=3D'localhost', port=3D6379, db=3D0)

context =3D pa.d=
efault_serialization_context()
r.set("key&=
quot;, c=
ontext.s=
erialize(df)=
.to_=
buffer().to_pybytes())
context.=
deserialize(r.ge=
t("=
key"))
   A
0  1<=
span style=3D"margin:0px;padding:0px;border:0px;font-style:inherit;font-var=
iant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inherit;f=
ont-family:inherit;vertical-align:baseline;box-sizing:inherit">
1  2<=
span style=3D"margin:0px;padding:0px;border:0px;font-style:inherit;font-var=
iant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inherit;f=
ont-family:inherit;vertical-align:baseline;box-sizing:inherit">
2  3<=
/code>
```

I wonder if this serial= ized 'pybytes' can be parsed at the Java end? If not, how can I ach= ieve this properly?

Thanks!

--

Best Regards,
Jiaxing Wang
=C2=A0


--

Best Regards,
Jiaxing Wang
=C2=A0
--000000000000f4d54b05ab014b32--