arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hagai Har-Gil <hagaihar...@protonmail.com>
Subject Re: [Python] Efficient numpy.recarray to pyarrow.StructArray conversion
Date Sun, 21 Mar 2021 12:33:09 GMT
After some more digging I did arrive at something which seems more efficient than what I had:

struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), ('field1', '<i1')])
struct_array = pa.array(nparray, type=struct_schema)

This looks easy, although I'm not sure how much copying is done down below.

I now have an issue with the Rust implementation since I'm not sure how do I access or iterate
over the rows of the resulting StructArray, which was trivial in Python.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil <hagaihargil@protonmail.com> wrote:

> After some more digging I did arrive at something which seems more efficient than what
I had:
>
> struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
> nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), ('field1', '<i1')])
> struct_array = pa.array(nparray, type=struct_schema)
>
> This looks easy, although I'm not sure how much copying is done down below.
>
> I now have an issue with the Rust implementation since I'm not sure how do I access or
iterate over the rows of the resulting StructArray.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil <hagaihargil@protonmail.com>
wrote:
>
>> Hi,
>>
>> I'm trying to efficiently convert incoming numpy.recarray's to pyarrow.StructArray
and I'm unsure how to do so with the least amount of copying.
>>
>> My use case involves real time data processing of numpy.recarrays in Rust. I'm happily
using the IPC protocol to transfer data to Rust's arrow implementation which will do the heavy
lifting. I'll need to iterate on the recarray-turned-StructArray line-by-line, each time yielding
all fields of a specific row, so the StructArray format is quite fitting. However, doing the
actual conversion in an efficient manner seems harder than expected. The fields (=individual
arrays) of a numpy.recarray aren't stored in a contiguous manner, so any numpy.recarray ->
pyarrow.Array conversion first has to copy the data to standard pyarrow.Array buffers, and
then re-construct the StructArray structure by interleaving the arrays. I was unable to find
in the docs or in previous discussions here a better approach for this type of pre-processing
step.
>>
>> Since I'm using IPC I'll eventually need to have the pyarrow.StructArray wrapped
in a pyarrow.RecordBatch if that makes any difference.
>>
>> Thanks in advance.
Mime
View raw message