arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [Python] Efficient numpy.recarray to pyarrow.StructArray conversion
Date Sun, 21 Mar 2021 17:02:42 GMT
>
> This is a conceptual constraint.  I don't think it is possible to
> create a Numpy struct array that would use separate data areas for the
> struct fields.

If the requirement is to convert these to StructArray's then this is
accurate

There was a proposed "struct" type but it never got implemented, and this
use-case seems somewhat niche.  Using an Extension Array on top of
 FixedSizeByteArray seems like the best representation to possibly avoid
copying (not sure if existing python translation libraries would actually
make this zero copy).

-Micah

On Sun, Mar 21, 2021 at 6:51 AM Antoine Pitrou <antoine@python.org> wrote:

> On Sun, 21 Mar 2021 12:33:09 +0000
> Hagai Har-Gil <hagaihargil@protonmail.com> wrote:
> > After some more digging I did arrive at something which seems more
> efficient than what I had:
> >
> > struct_schema = pa.struct([('field0', pa.int32()), ('field1',
> pa.int8())])
> > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'),
> ('field1', '<i1')])
> > struct_array = pa.array(nparray, type=struct_schema)
> >
> > This looks easy, although I'm not sure how much copying is done down
> below.
>
> The data is definitely copied under the hood, since this is
> converting from an "array of structs" layout (the Numpy array) to a
> "struct of arrays" layout (the Arrow array).
>
> This is a conceptual constraint.  I don't think it is possible to
> create a Numpy struct array that would use separate data areas for the
> struct fields.
>
> Regards
>
> Antoine.
>
>
>
> >
> > I now have an issue with the Rust implementation since I'm not sure how
> do I access or iterate over the rows of the resulting StructArray, which
> was trivial in Python.
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil <
> hagaihargil@protonmail.com> wrote:
> >
> > > After some more digging I did arrive at something which seems more
> efficient than what I had:
> > >
> > > struct_schema = pa.struct([('field0', pa.int32()), ('field1',
> pa.int8())])
> > > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'),
> ('field1', '<i1')])
> > > struct_array = pa.array(nparray, type=struct_schema)
> > >
> > > This looks easy, although I'm not sure how much copying is done down
> below.
> > >
> > > I now have an issue with the Rust implementation since I'm not sure
> how do I access or iterate over the rows of the resulting StructArray.
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil <
> hagaihargil@protonmail.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm trying to efficiently convert incoming numpy.recarray's to
> pyarrow.StructArray and I'm unsure how to do so with the least amount of
> copying.
> > >>
> > >> My use case involves real time data processing of numpy.recarrays in
> Rust. I'm happily using the IPC protocol to transfer data to Rust's arrow
> implementation which will do the heavy lifting. I'll need to iterate on the
> recarray-turned-StructArray line-by-line, each time yielding all fields of
> a specific row, so the StructArray format is quite fitting. However, doing
> the actual conversion in an efficient manner seems harder than expected.
> The fields (=individual arrays) of a numpy.recarray aren't stored in a
> contiguous manner, so any numpy.recarray -> pyarrow.Array conversion first
> has to copy the data to standard pyarrow.Array buffers, and then
> re-construct the StructArray structure by interleaving the arrays. I was
> unable to find in the docs or in previous discussions here a better
> approach for this type of pre-processing step.
> > >>
> > >> Since I'm using IPC I'll eventually need to have the
> pyarrow.StructArray wrapped in a pyarrow.RecordBatch if that makes any
> difference.
> > >>
> > >> Thanks in advance
>
>
>
>

Mime
View raw message