arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Petrov <capacyt...@gmail.com>
Subject Re: Optimising pandas relational ops with pyarrow
Date Sat, 02 Jan 2021 00:11:20 GMT
I can help with Java-scala, and have 0 exp in c++. It’s another risk for
us, we have several JVM experts and 0 c++ guys. Efficient Distributed join
is a mess btw ;) we would have to solve oom problems and do it though disk.
Impala passed this painful stage 8 years ago...

On Sat, 2 Jan 2021 at 00:48, Wes McKinney <wesmckinn@gmail.com> wrote:

> Note that many of us think it's important to have canonical
> implementations of important algorithms (aggregate / hash aggregate,
> joins, sorts, etc.) in the Apache project and available to e.g.
> pyarrow users, as opposed to having to direct them to a third party
> project. I've been unable to do this work myself given my other
> responsibilities, but I will be continuing to direct funding /
> engineering time from my organization toward these goals. I hope that
> others from the community can join in to help out to make the work go
> faster.
>
> On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <capacytron@gmail.com> wrote:
> >
> > Hi, thanks for the pointers. We tried cylondata already. We find it hard
> to build, some lack of tests for Java, seems like sort and filter not
> supported yet...
> > We are short on time that is why we can’t afford to build own ci/cd for
> cylondata...
> > Project looks very promising and for now it’s a huge technical risk for
> us.
> >
> >
> > On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <vibhatha@gmail.com>
> wrote:
> >>
> >> Checkout https://cylondata.org/.
> >>
> >> We have also worked on this problem in both sequential and distributed
> execution mode. An early DataFrame API is also available.
> >>
> >> [1]. https://cylondata.org/docs/python
> >> [2]. https://cylondata.org/docs/python_api_docs
> >>
> >>
> >> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <chris@techascent.com>
> wrote:
> >>>
> >>> Ivan,
> >>>
> >>> The Clojure dataset abstraction does not copy the data, uses mmap, and
> is generally extremely fast for aggregate group-by operations. Just FYI.
> >>>
> >>>
> >>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <capacytron@gmail.com>
> wrote:
> >>>>
> >>>> Hi!
> >>>> I plan to:
> >>>> -  join
> >>>> - group by
> >>>> - filter
> >>>> data using pyarrow (new to it). The idea is to get better performance
> and memory utilisation ( apache arrow columnar compression) compared to
> pandas.
> >>>> Seems like pyarrow has no support for joining two Tables / Dataset by
> key so I have to fallback to pandas.
> >>>> I don’t really follow how pyarrow <-> pandas integration works.
Will
> pandas rely on apache arrow data structure? I’m fine with using only these
> flat types for columns to avoid "corner cases"
> >>>> - string
> >>>> - int
> >>>> - long
> >>>> - decimal
> >>>>
> >>>> I have a feeling that pandas will copy all data from apache arrow and
> double the size (according to the doc). Did I get it right?
> >>>> What is the right way to join, groupBy and filter several "Tables" /
> "Datasets" utilizing pyarrow (underlying apache arrow) power?
> >>>>
> >>>> Thank you!
> >>
> >> --
> >> Vibhatha Abeykoon
>

Mime
View raw message