arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Optimising pandas relational ops with pyarrow
Date Fri, 01 Jan 2021 23:47:16 GMT
Note that many of us think it's important to have canonical
implementations of important algorithms (aggregate / hash aggregate,
joins, sorts, etc.) in the Apache project and available to e.g.
pyarrow users, as opposed to having to direct them to a third party
project. I've been unable to do this work myself given my other
responsibilities, but I will be continuing to direct funding /
engineering time from my organization toward these goals. I hope that
others from the community can join in to help out to make the work go
faster.

On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <capacytron@gmail.com> wrote:
>
> Hi, thanks for the pointers. We tried cylondata already. We find it hard to build, some
lack of tests for Java, seems like sort and filter not supported yet...
> We are short on time that is why we can’t afford to build own ci/cd for cylondata...
> Project looks very promising and for now it’s a huge technical risk for us.
>
>
> On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <vibhatha@gmail.com> wrote:
>>
>> Checkout https://cylondata.org/.
>>
>> We have also worked on this problem in both sequential and distributed execution
mode. An early DataFrame API is also available.
>>
>> [1]. https://cylondata.org/docs/python
>> [2]. https://cylondata.org/docs/python_api_docs
>>
>>
>> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <chris@techascent.com> wrote:
>>>
>>> Ivan,
>>>
>>> The Clojure dataset abstraction does not copy the data, uses mmap, and is generally
extremely fast for aggregate group-by operations. Just FYI.
>>>
>>>
>>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <capacytron@gmail.com> wrote:
>>>>
>>>> Hi!
>>>> I plan to:
>>>> -  join
>>>> - group by
>>>> - filter
>>>> data using pyarrow (new to it). The idea is to get better performance and
memory utilisation ( apache arrow columnar compression) compared to pandas.
>>>> Seems like pyarrow has no support for joining two Tables / Dataset by key
so I have to fallback to pandas.
>>>> I don’t really follow how pyarrow <-> pandas integration works. Will
pandas rely on apache arrow data structure? I’m fine with using only these flat types for
columns to avoid "corner cases"
>>>> - string
>>>> - int
>>>> - long
>>>> - decimal
>>>>
>>>> I have a feeling that pandas will copy all data from apache arrow and double
the size (according to the doc). Did I get it right?
>>>> What is the right way to join, groupBy and filter several "Tables" / "Datasets"
utilizing pyarrow (underlying apache arrow) power?
>>>>
>>>> Thank you!
>>
>> --
>> Vibhatha Abeykoon

Mime
View raw message