arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn" <m...@uwekorn.com>
Subject Re: Creating and populating Arrow table directly?
Date Sun, 18 Oct 2020 19:41:52 GMT
Hello,

You actually can use NumPy arrays to construct an Arrow array without the need to copy any
data. The important aspect here is to treat these NumPy arrays simply as plain memory allocations.
You use it to construct the separate memory memory buffers (i.e. the valid-bits and data buffers)
and then pass it top pyarrow.Array.from_buffers. One example of this can be seen here: https://github.com/xhochy/fletcher/blob/467054f0954c5c5796c6a5ced10891ce0e02e064/fletcher/algorithms/bool.py#L397-L407
<https://github.com/xhochy/fletcher/blob/467054f0954c5c5796c6a5ced10891ce0e02e064/fletcher/algorithms/bool.py#L397-L407>
(context is a library where I construct a lot of pyarrow.Array instances from Python using
Numba). Your underlying memory can be mutable as long as you keep it in the form of the Arrow
memory format and treat it as immutable once you have passed your data structure to a third-party
library/function.

Best
Uwe

> Am 13.10.2020 um 22:00 schrieb Jonathan Yu <jonathan.i.yu@gmail.com>:
> 
> Hey Jacob!
> 
> Thanks so much for your response and explanation.
> 
> On Mon, Oct 12, 2020 at 9:13 PM Jacob Quinn <quinn.jacobd@gmail.com <mailto:quinn.jacobd@gmail.com>>
wrote:
> I'm not familiar with the internals of the pyarrow implementation, but as the primary
author of the Arrow.jl Julia implementation, I think I can provide a little insight that's
probably applicable.
> 
> Awesome! One of these days, I hope to get around to learning Julia :) 
> 
> The conceptual problem here is that the arrow format is immutable; arrow data is laid
out in a fixed memory pattern. For the simplest column types (integer, float, etc.), this
isn't inherently a problem because it's straightforward to allocate the fixed amount of memory,
and then you could set the memory slots for array elements. For non-"primitive" types, however,
it quickly becomes nontrivial. String columns, for example, are an array-of-array memory layout,
where all the string bytes are laid out end to end, then individual column elements contain
offsets into that fixed memory blob. That can be pretty tricky to 1) pre-allocate and 2) allow
"setting" values afterward. You would need a way to allocate the exact # of bytes from all
the strings, then chase the right indirections when setting values and generating the correct
offsets.
> 
> This makes sense to me, but wasn't clear to me reading the documentation for PyArrow,
so I'll try to contribute something when I have some free time.
> 
> So with all that said, my guess is that creating the NumPy arrays with your data and
getting the data set first, then converting to the arrow format is indeed an acceptable workflow.
Hopefully that helps.
> 
> This also makes sense to me given the current available APIs. However, since some of
the documentation/blogs I've read about Arrow make reference to some of the inefficiencies
of converting to and from pandas or NumPy formats, namely that some of the representations
- like the bitmask for null values - are different, I wonder whether it might make sense to
provide "builder" classes that can be used temporarily to construct arrays value by value,
with a "freeze" method that would then return the immutable arrays? The contract would be
that users would not be allowed to modify the data in the builder class after invoking that
build/freeze function. This is at least a common pattern in Java to make it easier to build
immutable data structures.
> 
> I suppose that may be a question for the dev@ list, based on the guidance in "Contributing
to Arrow" documentation.
>  
> 
> -Jacob
> 
> On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <jonathan.i.yu@gmail.com <mailto:jonathan.i.yu@gmail.com>>
wrote:
> Hello there,
> 
> I'm recording an a-priori known number of entries per column, and I want to create a
Table using these entries. I'm currently using numpy.empty to pre-allocate empty arrays, then
creating a Table from that via the pyarrow.table(data={}) constructor.
> 
> It seems a bit silly to create a bunch of NumPy arrays, only to convert them to Arrow
arrays to serialize. Is there any benefit to creating/populating pyarrow.array() objects directly,
and if so, how do I do that? Otherwise, is the recommendation to first create a DataFrame
in pandas (or a number of NumPy arrays as I'm doing currently), then convert to a Table?
> 
> I think I want to have a way to create a fixed-size Table consisting of a number of columns,
then set the values for each column one by one (similar to iloc/iat in pandas). Is this a
sensible thing to try to do?
> 
> Best,
> 
> Jonathan
> 


Mime
View raw message