From user-return-507-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sat Jun 13 06:35:15 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id EFC1018064C for ; Sat, 13 Jun 2020 08:35:14 +0200 (CEST) Received: (qmail 53621 invoked by uid 500); 13 Jun 2020 06:35:13 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 53611 invoked by uid 99); 13 Jun 2020 06:35:13 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Jun 2020 06:35:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C9544C0399 for ; Sat, 13 Jun 2020 06:35:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.215 X-Spam-Level: ** X-Spam-Status: No, score=2.215 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=0.2, KAM_DMARC_STATUS=0.01, KAM_INFOUSMEBIZ=0.75, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id XkMdXcTHyg4N for ; Sat, 13 Jun 2020 06:35:08 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=216.40.44.100; helo=smtprelay.hostedemail.com; envelope-from=z@caudate.me; receiver= Received: from smtprelay.hostedemail.com (smtprelay0100.hostedemail.com [216.40.44.100]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id B9B5A7DE00 for ; Sat, 13 Jun 2020 06:35:07 +0000 (UTC) Received: from filter.hostedemail.com (clb03-v110.bra.tucows.net [216.40.38.60]) by smtprelay02.hostedemail.com (Postfix) with ESMTP id BEF1A15E7; Sat, 13 Jun 2020 06:34:59 +0000 (UTC) X-Session-Marker: 7A40636175646174652E6D65 X-Spam-Summary: 10,1,0,,d41d8cd98f00b204,z@caudate.me,,RULES_HIT:1:2:41:72:152:355:379:541:599:800:960:962:967:973:983:988:989:1189:1208:1212:1221:1260:1261:1313:1314:1345:1359:1381:1431:1436:1437:1516:1517:1518:1575:1588:1589:1592:1594:1606:1730:1776:1792:2068:2069:2198:2199:2525:2553:2568:2610:2682:2685:2693:2859:2901:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3353:3622:3865:3866:3867:3868:3870:3871:3872:3873:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4250:4321:4362:5007:6117:6119:6609:6657:7514:7652:7904:8957:9025:9036:9177:10004:11232:11473:11527:11657:11658:11914:12043:12050:12295:12296:12297:12438:12555:12740:12895:12986:13139:13141:13230:13869:14659:21060:21080:21107:21222:21347:21433:21451:21483:21627:21788:21796:21811:21889:21939:30003:30005:30034:30036:30054:30070:30079:30090:30091,0,RBL:none,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: match86_0a16bb726de3 X-Filterd-Recvd-Size: 10970 Received: from [127.0.0.1] (unknown [112.112.170.39]) (Authenticated sender: z@caudate.me) by omf05.hostedemail.com (Postfix) with ESMTPA; Sat, 13 Jun 2020 06:34:55 +0000 (UTC) From: Chris Zheng Content-Type: multipart/alternative; boundary="Apple-Mail=_023F3657-908A-4A88-BAF4-46A471AB7F2C" Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\)) Subject: Re: Using 'zero copy' for interop with python from java Date: Sat, 13 Jun 2020 14:34:46 +0800 References: <8D522D4E-FF0C-48CC-9921-E2AD0C32C29F@caudate.me> <6B541BF2-2298-4A7D-894E-D64800B94E44@caudate.me> <1413A2FC-8323-44AD-901F-C7306D53ED54@caudate.me> To: user@arrow.apache.org, emkornfield@gmail.com In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.3608.80.23.2.2) --Apple-Mail=_023F3657-908A-4A88-BAF4-46A471AB7F2C Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Micah, Thanks for the fantastic summary of what to do. I=E2=80=99ll have a play with it in the next few weeks.=20 Will keep you posted. Chris > On 12 Jun 2020, at 2:05 pm, Micah Kornfield = wrote: >=20 > Hi Chris, > There isn't anything prepackaged for this use-case as far as I know. = As Uwe mentioned it would probably be nice to build something using the = C interface for this purpose, but I think you should be able to do it = today as described below. >=20 > I think you can pass ArrowBuf pointers to python via foreign_buffer = [1], but as far as I know, you would probably have to do some amount = manual reconstructions of arrays from buffers. The rough steps would = be: > 1. Serialize the schema on the java side side [2] and obtain a memory = address from it to share with python (via foreign_buffer) . =20 > 2. Deserialize the schema on the python side using = pyarrow.ipc.read_schema [3]=20 > 3. Extract the buffer address/lengths in java (example from Gandiva = [4]) and reconstruct with foreign_object > 4. Traverse DataTypes the pyarrow schema to reconstruct the arrays = [5] based on number of buffers required [6].=20 >=20 > If you do end up doing this, then I think #4 might make a nice = contribution to the project. >=20 > Thanks, > Micah >=20 > [1] = https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html= = > [2] = https://arrow.apache.org/docs/java/org/apache/arrow/vector/ipc/message/Mes= sageSerializer.html#serializeMetadata-org.apache.arrow.vector.types.pojo.S= chema = > [3] = https://github.com/apache/arrow/blob/1164079d5442c3910c18549bfcd2e68d4554b= 909/python/pyarrow/ipc.pxi#L577 = > [4] = https://github.com/apache/arrow/blob/17bdb5af9b3c63f6cbef57e88a6d2513e781b= 532/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projecto= r.java#L139 = = > [5] = https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.= Array.from_buffers = > [6] = https://arrow.apache.org/docs/python/generated/pyarrow.DataType.html#pyarr= ow.DataType.num_buffers = >=20 >=20 > On Mon, Jun 8, 2020 at 12:55 AM Chris Zheng > wrote: > That blog post is really good. However, I=E2=80=99d like to do this in = a running JVM as opposed to a python program. >=20 >=20 >> On 8 Jun 2020, at 11:24 am, Micah Kornfield > wrote: >>=20 >> Uwe wrote a blog post [1] on how to do this with PY4J a while ago. I = think this ends up being zero copy but not 100% sure. =20 >>=20 >> [1] = https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jv= m.html = --Apple-Mail=_023F3657-908A-4A88-BAF4-46A471AB7F2C Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi = Micah,

Thanks for = the fantastic summary of what to do.

I=E2=80=99ll have a play with it in the = next few weeks. 

Will keep you posted.

Chris

On 12 = Jun 2020, at 2:05 pm, Micah Kornfield <emkornfield@gmail.com> wrote:

Hi Chris,
There isn't anything prepackaged for = this use-case as far as I know.  As Uwe mentioned it would probably = be nice to build something using the C interface for this purpose, but I = think you should be able to do it today as described below.

I think you can pass = ArrowBuf pointers to python via foreign_buffer [1], but as far as I = know, you would probably have to do some amount manual reconstructions = of arrays from buffers.  The rough steps would be:
1.  Serialize the schema on the java side side [2] and = obtain a memory address from it to share with python (via = foreign_buffer) .  
2.  Deserialize = the schema on the python side using pyarrow.ipc.read_schema = [3] 
3.  Extract the buffer = address/lengths in java (example from Gandiva [4]) and reconstruct with = foreign_object
4.  Traverse DataTypes the = pyarrow schema to reconstruct the arrays [5] based on number of buffers = required [6]. 

If you do end up doing this, then I think #4 might make a = nice contribution to the project.

Thanks,
Micah



On Mon, Jun 8, 2020 at 12:55 AM Chris = Zheng <z@caudate.me> = wrote:
That blog post is really good. However, I=E2=80=99= d like to do this in a running JVM as opposed to a python program.


On 8 Jun 2020, at 11:24 am, = Micah Kornfield <emkornfield@gmail.com> = wrote:

Uwe= wrote a blog post [1] on how to do this with PY4J a while ago. I think = this ends up being zero copy but not 100% sure.  



= --Apple-Mail=_023F3657-908A-4A88-BAF4-46A471AB7F2C--