arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Petrov <capacyt...@gmail.com>
Subject Re: Optimising pandas relational ops with pyarrow
Date Fri, 01 Jan 2021 23:36:29 GMT
Hi, thanks for the pointers. We tried cylondata already. We find it hard to
build, some lack of tests for Java, seems like sort and filter not
supported yet...
We are short on time that is why we can’t afford to build own ci/cd for
cylondata...
Project looks very promising and for now it’s a huge technical risk for us.


On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <vibhatha@gmail.com> wrote:

> Checkout https://cylondata.org/.
>
> We have also worked on this problem in both sequential and distributed
> execution mode. An early DataFrame API is also available.
>
> [1]. https://cylondata.org/docs/python
> [2]. https://cylondata.org/docs/python_api_docs
>
>
> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <chris@techascent.com>
> wrote:
>
>> Ivan,
>>
>> The Clojure dataset abstraction does not copy the data, uses mmap, and is
>> generally extremely fast for aggregate group-by operations
>> <https://github.com/zero-one-group/geni-performance-benchmark/>. Just
>> FYI.
>>
>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <capacytron@gmail.com> wrote:
>>
>>> Hi!
>>> I plan to:
>>> -  join
>>> - group by
>>> - filter
>>> data using pyarrow (new to it). The idea is to get better performance
>>> and memory utilisation ( apache arrow columnar compression) compared to
>>> pandas.
>>> Seems like pyarrow has no support for joining two Tables / Dataset by
>>> key so I have to fallback to pandas.
>>> I don’t really follow how pyarrow <-> pandas integration works. Will
>>> pandas rely on apache arrow data structure? I’m fine with using only these
>>> flat types for columns to avoid "corner cases"
>>> - string
>>> - int
>>> - long
>>> - decimal
>>>
>>> I have a feeling that pandas will copy all data from apache arrow and
>>> double the size (according to the doc). Did I get it right?
>>> What is the right way to join, groupBy and filter several "Tables" /
>>> "Datasets" utilizing pyarrow (underlying apache arrow) power?
>>>
>>> Thank you!
>>>
>> --
> Vibhatha Abeykoon
>

Mime
View raw message