arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Yu <jonathan.i...@gmail.com>
Subject Re: Creating and populating Arrow table directly?
Date Tue, 13 Oct 2020 20:00:05 GMT
Hey Jacob!

Thanks so much for your response and explanation.

On Mon, Oct 12, 2020 at 9:13 PM Jacob Quinn <quinn.jacobd@gmail.com> wrote:

> I'm not familiar with the internals of the pyarrow implementation, but as
> the primary author of the Arrow.jl Julia implementation, I think I can
> provide a little insight that's probably applicable.
>

Awesome! One of these days, I hope to get around to learning Julia :)

>
> The conceptual problem here is that the arrow format is immutable; arrow
> data is laid out in a fixed memory pattern. For the simplest column types
> (integer, float, etc.), this isn't inherently a problem because it's
> straightforward to allocate the fixed amount of memory, and then you could
> set the memory slots for array elements. For non-"primitive" types,
> however, it quickly becomes nontrivial. String columns, for example, are an
> array-of-array memory layout, where all the string bytes are laid out end
> to end, then individual column elements contain offsets into that fixed
> memory blob. That can be pretty tricky to 1) pre-allocate and 2) allow
> "setting" values afterward. You would need a way to allocate the exact # of
> bytes from all the strings, then chase the right indirections when setting
> values and generating the correct offsets.
>

This makes sense to me, but wasn't clear to me reading the documentation
for PyArrow, so I'll try to contribute something when I have some free time.

So with all that said, my guess is that creating the NumPy arrays with your
> data and getting the data set first, then converting to the arrow format is
> indeed an acceptable workflow. Hopefully that helps.
>

This also makes sense to me given the current available APIs. However,
since some of the documentation/blogs I've read about Arrow make reference
to some of the inefficiencies of converting to and from pandas or NumPy
formats, namely that some of the representations - like the bitmask for
null values - are different, I wonder whether it might make sense to
provide "builder" classes that can be used temporarily to construct arrays
value by value, with a "freeze" method that would then return the immutable
arrays? The contract would be that users would not be allowed to modify the
data in the builder class after invoking that build/freeze function. This
is at least a common pattern in Java to make it easier to build immutable
data structures.

I suppose that may be a question for the dev@ list, based on the guidance
in "Contributing to Arrow" documentation.


>
> -Jacob
>
> On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <jonathan.i.yu@gmail.com>
> wrote:
>
>> Hello there,
>>
>> I'm recording an a-priori known number of entries per column, and I want
>> to create a Table using these entries. I'm currently using numpy.empty to
>> pre-allocate empty arrays, then creating a Table from that via the
>> pyarrow.table(data={}) constructor.
>>
>> It seems a bit silly to create a bunch of NumPy arrays, only to convert
>> them to Arrow arrays to serialize. Is there any benefit to
>> creating/populating pyarrow.array() objects directly, and if so, how do I
>> do that? Otherwise, is the recommendation to first create a DataFrame in
>> pandas (or a number of NumPy arrays as I'm doing currently), then convert
>> to a Table?
>>
>> I think I want to have a way to create a fixed-size Table consisting of a
>> number of columns, then set the values for each column one by one (similar
>> to iloc/iat in pandas). Is this a sensible thing to try to do?
>>
>> Best,
>>
>> Jonathan
>>
>

Mime
View raw message