arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [Python] Efficient numpy.recarray to pyarrow.StructArray conversion
Date Mon, 22 Mar 2021 10:29:54 GMT
On Mon, 22 Mar 2021 06:36:57 +0000
Hagai Har-Gil <hagaihargil@protonmail.com> wrote:
> Hmm, it seems that my mental model was off - I'm indeed interested in an array of structs
and not in a struct of arrays. After re-reading the (Python) docs I'd argue that they're not
clear that a StructArray is indeed a SoA, and the behavior of the object with respect to indexing
further strengthens this notion I had. I might try to put together a docs PR to address this,
if you think it's worth mentioning.

I don't think it makes sense to mention it specifically in the Python
docs, since it's a characteristic of the Arrow format and applies to
all implementations:
https://arrow.apache.org/docs/format/Columnar.html#struct-layout

Regards

Antoine.



> 
> Thanks,
> Hagai.
> 
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Sunday, March 21, 2021 3:51 PM, Antoine Pitrou <antoine@python.org> wrote:
> 
> > On Sun, 21 Mar 2021 12:33:09 +0000
> > Hagai Har-Gil hagaihargil@protonmail.com wrote:
> >  
> > > After some more digging I did arrive at something which seems more efficient
than what I had:
> > > struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
> > > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), ('field1',
'<i1')])
> > > struct_array = pa.array(nparray, type=struct_schema)
> > > This looks easy, although I'm not sure how much copying is done down below.
 
> >
> > The data is definitely copied under the hood, since this is
> > converting from an "array of structs" layout (the Numpy array) to a
> > "struct of arrays" layout (the Arrow array).
> >
> > This is a conceptual constraint. I don't think it is possible to
> > create a Numpy struct array that would use separate data areas for the
> > struct fields.
> >
> > Regards
> >
> > Antoine.
> >  
> > > I now have an issue with the Rust implementation since I'm not sure how do
I access or iterate over the rows of the resulting StructArray, which was trivial in Python.
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil hagaihargil@protonmail.com
wrote:
> > >  
> > > > After some more digging I did arrive at something which seems more efficient
than what I had:
> > > > struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
> > > > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'),
('field1', '<i1')])
> > > > struct_array = pa.array(nparray, type=struct_schema)
> > > > This looks easy, although I'm not sure how much copying is done down below.
> > > > I now have an issue with the Rust implementation since I'm not sure how
do I access or iterate over the rows of the resulting StructArray.
> > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil hagaihargil@protonmail.com
wrote:
> > > >  
> > > > > Hi,
> > > > > I'm trying to efficiently convert incoming numpy.recarray's to pyarrow.StructArray
and I'm unsure how to do so with the least amount of copying.
> > > > > My use case involves real time data processing of numpy.recarrays
in Rust. I'm happily using the IPC protocol to transfer data to Rust's arrow implementation
which will do the heavy lifting. I'll need to iterate on the recarray-turned-StructArray line-by-line,
each time yielding all fields of a specific row, so the StructArray format is quite fitting.
However, doing the actual conversion in an efficient manner seems harder than expected. The
fields (=individual arrays) of a numpy.recarray aren't stored in a contiguous manner, so any
numpy.recarray -> pyarrow.Array conversion first has to copy the data to standard pyarrow.Array
buffers, and then re-construct the StructArray structure by interleaving the arrays. I was
unable to find in the docs or in previous discussions here a better approach for this type
of pre-processing step.
> > > > > Since I'm using IPC I'll eventually need to have the pyarrow.StructArray
wrapped in a pyarrow.RecordBatch if that makes any difference.
> > > > > Thanks in advance  
> 
> 
> 




Mime
View raw message