arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Quinn <quinn.jac...@gmail.com>
Subject Re: Creating and populating Arrow table directly?
Date Tue, 13 Oct 2020 04:12:59 GMT
I'm not familiar with the internals of the pyarrow implementation, but as
the primary author of the Arrow.jl Julia implementation, I think I can
provide a little insight that's probably applicable.

The conceptual problem here is that the arrow format is immutable; arrow
data is laid out in a fixed memory pattern. For the simplest column types
(integer, float, etc.), this isn't inherently a problem because it's
straightforward to allocate the fixed amount of memory, and then you could
set the memory slots for array elements. For non-"primitive" types,
however, it quickly becomes nontrivial. String columns, for example, are an
array-of-array memory layout, where all the string bytes are laid out end
to end, then individual column elements contain offsets into that fixed
memory blob. That can be pretty tricky to 1) pre-allocate and 2) allow
"setting" values afterward. You would need a way to allocate the exact # of
bytes from all the strings, then chase the right indirections when setting
values and generating the correct offsets.

So with all that said, my guess is that creating the NumPy arrays with your
data and getting the data set first, then converting to the arrow format is
indeed an acceptable workflow. Hopefully that helps.

-Jacob

On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <jonathan.i.yu@gmail.com> wrote:

> Hello there,
>
> I'm recording an a-priori known number of entries per column, and I want
> to create a Table using these entries. I'm currently using numpy.empty to
> pre-allocate empty arrays, then creating a Table from that via the
> pyarrow.table(data={}) constructor.
>
> It seems a bit silly to create a bunch of NumPy arrays, only to convert
> them to Arrow arrays to serialize. Is there any benefit to
> creating/populating pyarrow.array() objects directly, and if so, how do I
> do that? Otherwise, is the recommendation to first create a DataFrame in
> pandas (or a number of NumPy arrays as I'm doing currently), then convert
> to a Table?
>
> I think I want to have a way to create a fixed-size Table consisting of a
> number of columns, then set the values for each column one by one (similar
> to iloc/iat in pandas). Is this a sensible thing to try to do?
>
> Best,
>
> Jonathan
>

Mime
View raw message