arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vibhatha Abeykoon <vibha...@gmail.com>
Subject Re: Optimising pandas relational ops with pyarrow
Date Fri, 01 Jan 2021 23:25:04 GMT
Checkout https://cylondata.org/.

We have also worked on this problem in both sequential and distributed
execution mode. An early DataFrame API is also available.

[1]. https://cylondata.org/docs/python
[2]. https://cylondata.org/docs/python_api_docs


On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <chris@techascent.com>
wrote:

> Ivan,
>
> The Clojure dataset abstraction does not copy the data, uses mmap, and is
> generally extremely fast for aggregate group-by operations
> <https://github.com/zero-one-group/geni-performance-benchmark/>. Just FYI.
>
> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <capacytron@gmail.com> wrote:
>
>> Hi!
>> I plan to:
>> -  join
>> - group by
>> - filter
>> data using pyarrow (new to it). The idea is to get better performance and
>> memory utilisation ( apache arrow columnar compression) compared to pandas.
>> Seems like pyarrow has no support for joining two Tables / Dataset by key
>> so I have to fallback to pandas.
>> I don’t really follow how pyarrow <-> pandas integration works. Will
>> pandas rely on apache arrow data structure? I’m fine with using only these
>> flat types for columns to avoid "corner cases"
>> - string
>> - int
>> - long
>> - decimal
>>
>> I have a feeling that pandas will copy all data from apache arrow and
>> double the size (according to the doc). Did I get it right?
>> What is the right way to join, groupBy and filter several "Tables" /
>> "Datasets" utilizing pyarrow (underlying apache arrow) power?
>>
>> Thank you!
>>
> --
Vibhatha Abeykoon

Mime
View raw message