arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hagai Har-Gil <>
Subject [Python] Efficient numpy.recarray to pyarrow.StructArray conversion
Date Sun, 21 Mar 2021 08:52:07 GMT

I'm trying to efficiently convert incoming numpy.recarray's to pyarrow.StructArray and I'm
unsure how to do so with the least amount of copying.

My use case involves real time data processing of numpy.recarrays in Rust. I'm happily using
the IPC protocol to transfer data to Rust's arrow implementation which will do the heavy lifting.
I'll need to iterate on the recarray-turned-StructArray line-by-line, each time yielding all
fields of a specific row, so the StructArray format is quite fitting. However, doing the actual
conversion in an efficient manner seems harder than expected. The fields (=individual arrays)
of a numpy.recarray aren't stored in a contiguous manner, so any numpy.recarray -> pyarrow.Array
conversion first has to copy the data to standard pyarrow.Array buffers, and then re-construct
the StructArray structure by interleaving the arrays. I was unable to find in the docs or
in previous discussions here a better approach for this type of pre-processing step.

Since I'm using IPC I'll eventually need to have the pyarrow.StructArray wrapped in a pyarrow.RecordBatch
if that makes any difference.

Thanks in advance.
View raw message