arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Are Arrow, Flight and Plasma suitable for my use case?
Date Fri, 19 Mar 2021 17:40:30 GMT
The topic of putting tensors in an Arrow record batch column has come
up many times, the problem is only waiting for a champion to propose a
solution and implement it (particularly in the C++ side, it would be
pretty straightforward to implement this as an extension type on top
of binary arrays). If someone would like to fund this work, feel free
to get in touch with me offline.

On Fri, Mar 19, 2021 at 3:20 AM Fernando Herrera
<fernando.j.herrera@gmail.com> wrote:
>
> Hi Matias,
>
> If you are going to do tensor operations, then you could use the Arrow tensor
> representation.
>
> https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html
>
> However, I don't think the data stored in the tensor will be compressed. It will be
> orderly stored so you can share the tensors with other processes.
>
> I hope that helps
> Fernando
>
> On Fri, Mar 19, 2021 at 8:52 AM Matias Guijarro <matias.guijarro@free.fr> wrote:
>>
>> Hi !
>>
>> I recently learned about Apache Arrow, and as a preliminary study I would
>> like to know if it can be a good choice for my use case, or if I have to
>> look
>> for another technology (or to craft something specific on my own !).
>>
>> I could not really find answers to my questions in the FAQ or reading
>> articles and blogs, but I may have missed some information so I apologize
>> in advance if my questions have already been answered.
>>
>> Arrow is all about storing columnar data. What can be the content of the
>> elements in a column ?
>>
>> In my case, I have scalar values (numbers), 1D arrays and 2D arrays.
>> The 2D arrays can be quite big (4000x4000 float 32 for example).
>> So, we could imagine long tables, hundred thousands of lines, containing
>> a mix of those data types.
>>
>> I wonder if Arrow stays efficient for such kind of data ? In particular,
>> rows of 2D data arrays in a column may be difficult to handle with the
>> same level of optimization ? (just guessing)
>>
>> Is there some compression in Arrow ? I am thinking about blosc kind of
>> compression (like in the dead "bcolz" project - by the way someone already
>> wondered about Arrow + Blosc: https://github.com/Blosc/bcolz/issues/300)
>>
>> Another use case I have, is to be able for multiple processes on the same
>> computer to access the Arrow in-memory store ; it seems to me Plasma
>> does this job but I wonder about the trade-offs ?
>>
>> Thanks in advance for your advices - any help would be highly appreciated !
>>
>> Cheers,
>> Matias.
>>
>>
>>
>>
>>
>>

Mime
View raw message