From user-return-1103-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun Mar 21 17:03:00 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 2661818065E for ; Sun, 21 Mar 2021 18:03:00 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id CDF6662EDA for ; Sun, 21 Mar 2021 17:02:58 +0000 (UTC) Received: (qmail 44989 invoked by uid 500); 21 Mar 2021 17:02:57 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 44979 invoked by uid 99); 21 Mar 2021 17:02:57 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Mar 2021 17:02:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 304591FF439 for ; Sun, 21 Mar 2021 17:02:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id G_1NiiUxlcJS for ; Sun, 21 Mar 2021 17:02:55 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.218.47; helo=mail-ej1-f47.google.com; envelope-from=emkornfield@gmail.com; receiver= Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 279A5BCD4E for ; Sun, 21 Mar 2021 17:02:55 +0000 (UTC) Received: by mail-ej1-f47.google.com with SMTP id u9so17475206ejj.7 for ; Sun, 21 Mar 2021 10:02:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to; bh=4J2ixHHx3nBh5xDA/CKahpojCJFsIepqApEIF/xEyZo=; b=IPbjXXOXuNSEk9GpnnAq4St1LejtrsNWQIER9e6QnZnz9R8f63LJs3UiUQIKl55hU4 Sv8JNNNFtLzCHrKoMwBzIuTxIqGpKT19PFeLdSJKzJMvqU8TmUkWhrmgIvV2VwJMqKUq 72lv19LJbnV3YP2iwFPqQL7fm77wzXXxxuKz301BtYEeOezQ9BDok2xeVYmKN4iIHzua lTZXyFVt7f1nvPQe5FLC7PwuHhGef3llo5TOdCRBZlELv5O2jvQc7XJDMif3AIulRgMa nNQFxRXufZyRr8Pqj34FPyjIW6kAwGbWguBghf2c5d38EwLh/TA0p9sRX86mYhr/0Bvn llhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to; bh=4J2ixHHx3nBh5xDA/CKahpojCJFsIepqApEIF/xEyZo=; b=RJIb5ONF4OLmNLg5LqHf/RXY82MLhPWcTeY7/YoQ808FPMnd6AAIi8IodD3D+fmn4H POk9pv34ya16vDsxmnpfi8lIFTjVjITyzQph41TEJ1UIt+k8yiofuP+5QI19SH9RqQhV uX3EglAMbEf8r32kXXSnLevQzWClsoMMWpsABOXXFUtMbwnxIrzidtDb41k+bnFKLcHp llhhsNvJsEV2/aDCmTlUoKsrK1ECwHAdKkN/dURZ5Ak2vx1hIyP1LfJOgHFBZEzgk/S1 ZjmYN9LqCYnmn2yWazTpGx5GoPY+gAVqBp0PCv9t8vbc9DgaqxI3oW+AdxB8OJsAuGgk ZjQQ== X-Gm-Message-State: AOAM5322PfhTNbbJzgtnOLCdVq/gMFkHR47UBY11H4S6UAM7YnOX9iY1 UjZDtGjV0F4i0cH6sen75T7a9KbrTNy8/EsYvbXoQ309h8uBaA== X-Google-Smtp-Source: ABdhPJwJo85//s0WEoHgXEK3mIm5CClwRmoK5XLxKhfH8YYqh2eicqiwZpRWRXScgezCW8rWCogiiWsl18585gBly+Y= X-Received: by 2002:a17:906:14d0:: with SMTP id y16mr15440468ejc.242.1616346173868; Sun, 21 Mar 2021 10:02:53 -0700 (PDT) MIME-Version: 1.0 References: <8PA3x2qpQ5-XmPs76bw5Tm-GoSDFcLsVe6IPhBpURzL6ppTW3BLszhQrHDZJABCwNbshI2f_VHRgvikyhv92tH_W03bddoLMH-_tSgLCfXQ=@protonmail.com> <0Tm-mXRgzU_ZGDBOck1Er14bQ5ktUxuX3nHt5n3T-6tsj_CFTeDN-7KkEYjtCx8YV4ZbsnRKkjqYQ_42aPfEivPejWGE_zSaSa4lQULJaLA=@protonmail.com> <20210321145118.71a103ed@fsol> In-Reply-To: <20210321145118.71a103ed@fsol> Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Sun, 21 Mar 2021 10:02:42 -0700 Message-ID: Subject: Re: [Python] Efficient numpy.recarray to pyarrow.StructArray conversion To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000870f6405be0eeba6" --000000000000870f6405be0eeba6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable > > This is a conceptual constraint. I don't think it is possible to > create a Numpy struct array that would use separate data areas for the > struct fields. If the requirement is to convert these to StructArray's then this is accurate There was a proposed "struct" type but it never got implemented, and this use-case seems somewhat niche. Using an Extension Array on top of FixedSizeByteArray seems like the best representation to possibly avoid copying (not sure if existing python translation libraries would actually make this zero copy). -Micah On Sun, Mar 21, 2021 at 6:51 AM Antoine Pitrou wrote: > On Sun, 21 Mar 2021 12:33:09 +0000 > Hagai Har-Gil wrote: > > After some more digging I did arrive at something which seems more > efficient than what I had: > > > > struct_schema =3D pa.struct([('field0', pa.int32()), ('field1', > pa.int8())]) > > nparray =3D x =3D np.array([(0, 10), (1, 20)], dtype=3D[('field0', ' ('field1', ' > struct_array =3D pa.array(nparray, type=3Dstruct_schema) > > > > This looks easy, although I'm not sure how much copying is done down > below. > > The data is definitely copied under the hood, since this is > converting from an "array of structs" layout (the Numpy array) to a > "struct of arrays" layout (the Arrow array). > > This is a conceptual constraint. I don't think it is possible to > create a Numpy struct array that would use separate data areas for the > struct fields. > > Regards > > Antoine. > > > > > > > I now have an issue with the Rust implementation since I'm not sure how > do I access or iterate over the rows of the resulting StructArray, which > was trivial in Python. > > =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Origina= l Message =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 > > On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil < > hagaihargil@protonmail.com> wrote: > > > > > After some more digging I did arrive at something which seems more > efficient than what I had: > > > > > > struct_schema =3D pa.struct([('field0', pa.int32()), ('field1', > pa.int8())]) > > > nparray =3D x =3D np.array([(0, 10), (1, 20)], dtype=3D[('field0', '<= i4'), > ('field1', ' > > struct_array =3D pa.array(nparray, type=3Dstruct_schema) > > > > > > This looks easy, although I'm not sure how much copying is done down > below. > > > > > > I now have an issue with the Rust implementation since I'm not sure > how do I access or iterate over the rows of the resulting StructArray. > > > > > > =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Origi= nal Message =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 > > > On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil < > hagaihargil@protonmail.com> wrote: > > > > > >> Hi, > > >> > > >> I'm trying to efficiently convert incoming numpy.recarray's to > pyarrow.StructArray and I'm unsure how to do so with the least amount of > copying. > > >> > > >> My use case involves real time data processing of numpy.recarrays in > Rust. I'm happily using the IPC protocol to transfer data to Rust's arrow > implementation which will do the heavy lifting. I'll need to iterate on t= he > recarray-turned-StructArray line-by-line, each time yielding all fields o= f > a specific row, so the StructArray format is quite fitting. However, doin= g > the actual conversion in an efficient manner seems harder than expected. > The fields (=3Dindividual arrays) of a numpy.recarray aren't stored in a > contiguous manner, so any numpy.recarray -> pyarrow.Array conversion firs= t > has to copy the data to standard pyarrow.Array buffers, and then > re-construct the StructArray structure by interleaving the arrays. I was > unable to find in the docs or in previous discussions here a better > approach for this type of pre-processing step. > > >> > > >> Since I'm using IPC I'll eventually need to have the > pyarrow.StructArray wrapped in a pyarrow.RecordBatch if that makes any > difference. > > >> > > >> Thanks in advance > > > > --000000000000870f6405be0eeba6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
This is = a conceptual constraint.=C2=A0 I don't think it is possible to
creat= e a Numpy struct array that would use separate data areas for the
struct= fields.
If the requirement is to convert these to StructA= rray's then this is accurate
=C2=A0
There was a pro= posed "struct" type but it never got implemented, and this use-ca= se seems somewhat niche.=C2=A0 Using an Extension Array on top of=C2=A0 =C2= =A0FixedSizeByteArray seems like the best representation to possibly avoid = copying (not sure if existing python translation libraries would actually m= ake this zero copy).

-Micah

On Sun, Mar 21, 2= 021 at 6:51 AM Antoine Pitrou <ant= oine@python.org> wrote:
On Sun, 21 Mar 2021 12:33:09 +0000
Hagai Har-Gil <hagaihargil@protonmail.com> wrote:
> After some more digging I did arrive at something which seems more eff= icient than what I had:
>
> struct_schema =3D pa.struct([('field0', pa.int32()), ('fie= ld1', pa.int8())])
> nparray =3D x =3D np.array([(0, 10), (1, 20)], dtype=3D[('field0&#= 39;, '<i4'), ('field1', '<i1')])
> struct_array =3D pa.array(nparray, type=3Dstruct_schema)
>
> This looks easy, although I'm not sure how much copying is done do= wn below.

The data is definitely copied under the hood, since this is
converting from an "array of structs" layout (the Numpy array) to= a
"struct of arrays" layout (the Arrow array).

This is a conceptual constraint.=C2=A0 I don't think it is possible to<= br> create a Numpy struct array that would use separate data areas for the
struct fields.

Regards

Antoine.



>
> I now have an issue with the Rust implementation since I'm not sur= e how do I access or iterate over the rows of the resulting StructArray, wh= ich was trivial in Python.
> =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Origin= al Message =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90<= br> > On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil <hagaihargil@protonmail.com> wrote:
>
> > After some more digging I did arrive at something which seems mor= e efficient than what I had:
> >
> > struct_schema =3D pa.struct([('field0', pa.int32()), (= 9;field1', pa.int8())])
> > nparray =3D x =3D np.array([(0, 10), (1, 20)], dtype=3D[('fie= ld0', '<i4'), ('field1', '<i1')])
> > struct_array =3D pa.array(nparray, type=3Dstruct_schema)
> >
> > This looks easy, although I'm not sure how much copying is do= ne down below.
> >
> > I now have an issue with the Rust implementation since I'm no= t sure how do I access or iterate over the rows of the resulting StructArra= y.
> >
> > =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 O= riginal Message =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2= =80=90
> > On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil <
hagaihargil@protonmail= .com> wrote:
> >=C2=A0
> >> Hi,
> >>
> >> I'm trying to efficiently convert incoming numpy.recarray= 's to pyarrow.StructArray and I'm unsure how to do so with the leas= t amount of copying.
> >>
> >> My use case involves real time data processing of numpy.recar= rays in Rust. I'm happily using the IPC protocol to transfer data to Ru= st's arrow implementation which will do the heavy lifting. I'll nee= d to iterate on the recarray-turned-StructArray line-by-line, each time yie= lding all fields of a specific row, so the StructArray format is quite fitt= ing. However, doing the actual conversion in an efficient manner seems hard= er than expected. The fields (=3Dindividual arrays) of a numpy.recarray are= n't stored in a contiguous manner, so any numpy.recarray -> pyarrow.= Array conversion first has to copy the data to standard pyarrow.Array buffe= rs, and then re-construct the StructArray structure by interleaving the arr= ays. I was unable to find in the docs or in previous discussions here a bet= ter approach for this type of pre-processing step.
> >>
> >> Since I'm using IPC I'll eventually need to have the = pyarrow.StructArray wrapped in a pyarrow.RecordBatch if that makes any diff= erence.
> >>
> >> Thanks in advance=C2=A0



--000000000000870f6405be0eeba6--