From user-return-1106-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Mar 22 10:30:05 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 49F4D180621 for ; Mon, 22 Mar 2021 11:30:05 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 389D163B09 for ; Mon, 22 Mar 2021 10:30:04 +0000 (UTC) Received: (qmail 63200 invoked by uid 500); 22 Mar 2021 10:30:03 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 63190 invoked by uid 99); 22 Mar 2021 10:30:03 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Mar 2021 10:30:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 8CEB71FF479 for ; Mon, 22 Mar 2021 10:30:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0.248 X-Spam-Level: X-Spam-Status: No, score=0.248 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id 3vOH4CGnNZ17 for ; Mon, 22 Mar 2021 10:30:01 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=116.202.254.214; helo=ciao.gmane.io; envelope-from=gcaau-arrow-user@m.gmane-mx.org; receiver= Received: from ciao.gmane.io (ciao.gmane.io [116.202.254.214]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 5A854BD0DC for ; Mon, 22 Mar 2021 10:30:01 +0000 (UTC) Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1lOHos-00072j-P0 for user@arrow.apache.org; Mon, 22 Mar 2021 11:29:58 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: user@arrow.apache.org From: Antoine Pitrou Subject: Re: [Python] Efficient numpy.recarray to pyarrow.StructArray conversion Date: Mon, 22 Mar 2021 11:29:54 +0100 Message-ID: <20210322112954.3826132d@fsol> References: <8PA3x2qpQ5-XmPs76bw5Tm-GoSDFcLsVe6IPhBpURzL6ppTW3BLszhQrHDZJABCwNbshI2f_VHRgvikyhv92tH_W03bddoLMH-_tSgLCfXQ=@protonmail.com> <0Tm-mXRgzU_ZGDBOck1Er14bQ5ktUxuX3nHt5n3T-6tsj_CFTeDN-7KkEYjtCx8YV4ZbsnRKkjqYQ_42aPfEivPejWGE_zSaSa4lQULJaLA=@protonmail.com> <20210321145118.71a103ed@fsol> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Newsreader: Claws Mail 3.17.5 (GTK+ 2.24.32; x86_64-pc-linux-gnu) On Mon, 22 Mar 2021 06:36:57 +0000 Hagai Har-Gil wrote: > Hmm, it seems that my mental model was off - I'm indeed interested in an = array of structs and not in a struct of arrays. After re-reading the (Pytho= n) docs I'd argue that they're not clear that a StructArray is indeed a SoA= , and the behavior of the object with respect to indexing further strengthe= ns this notion I had. I might try to put together a docs PR to address this= , if you think it's worth mentioning. I don't think it makes sense to mention it specifically in the Python docs, since it's a characteristic of the Arrow format and applies to all implementations: https://arrow.apache.org/docs/format/Columnar.html#struct-layout Regards Antoine. >=20 > Thanks, > Hagai. >=20 > =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Original = Message =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 > On Sunday, March 21, 2021 3:51 PM, Antoine Pitrou wr= ote: >=20 > > On Sun, 21 Mar 2021 12:33:09 +0000 > > Hagai Har-Gil hagaihargil@protonmail.com wrote: > > =20 > > > After some more digging I did arrive at something which seems more ef= ficient than what I had: > > > struct_schema =3D pa.struct([('field0', pa.int32()), ('field1', pa.in= t8())]) > > > nparray =3D x =3D np.array([(0, 10), (1, 20)], dtype=3D[('field0', '<= i4'), ('field1', ' > > struct_array =3D pa.array(nparray, type=3Dstruct_schema) > > > This looks easy, although I'm not sure how much copying is done down = below. =20 > > > > The data is definitely copied under the hood, since this is > > converting from an "array of structs" layout (the Numpy array) to a > > "struct of arrays" layout (the Arrow array). > > > > This is a conceptual constraint. I don't think it is possible to > > create a Numpy struct array that would use separate data areas for the > > struct fields. > > > > Regards > > > > Antoine. > > =20 > > > I now have an issue with the Rust implementation since I'm not sure h= ow do I access or iterate over the rows of the resulting StructArray, which= was trivial in Python. > > > =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Origi= nal Message =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 > > > On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil hagaihargil@protonma= il.com wrote: > > > =20 > > > > After some more digging I did arrive at something which seems more = efficient than what I had: > > > > struct_schema =3D pa.struct([('field0', pa.int32()), ('field1', pa.= int8())]) > > > > nparray =3D x =3D np.array([(0, 10), (1, 20)], dtype=3D[('field0', = ' > > > struct_array =3D pa.array(nparray, type=3Dstruct_schema) > > > > This looks easy, although I'm not sure how much copying is done dow= n below. > > > > I now have an issue with the Rust implementation since I'm not sure= how do I access or iterate over the rows of the resulting StructArray. > > > > =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Ori= ginal Message =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80= =90 > > > > On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil hagaihargil@proto= nmail.com wrote: > > > > =20 > > > > > Hi, > > > > > I'm trying to efficiently convert incoming numpy.recarray's to py= arrow.StructArray and I'm unsure how to do so with the least amount of copy= ing. > > > > > My use case involves real time data processing of numpy.recarrays= in Rust. I'm happily using the IPC protocol to transfer data to Rust's arr= ow implementation which will do the heavy lifting. I'll need to iterate on = the recarray-turned-StructArray line-by-line, each time yielding all fields= of a specific row, so the StructArray format is quite fitting. However, do= ing the actual conversion in an efficient manner seems harder than expected= . The fields (=3Dindividual arrays) of a numpy.recarray aren't stored in a = contiguous manner, so any numpy.recarray -> pyarrow.Array conversion first = has to copy the data to standard pyarrow.Array buffers, and then re-constru= ct the StructArray structure by interleaving the arrays. I was unable to fi= nd in the docs or in previous discussions here a better approach for this t= ype of pre-processing step. > > > > > Since I'm using IPC I'll eventually need to have the pyarrow.Stru= ctArray wrapped in a pyarrow.RecordBatch if that makes any difference. > > > > > Thanks in advance =20 >=20 >=20 >=20