arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Petrov <capacyt...@gmail.com>
Subject Optimising pandas relational ops with pyarrow
Date Fri, 01 Jan 2021 17:23:52 GMT
Hi!
I plan to:
-  join
- group by
- filter
data using pyarrow (new to it). The idea is to get better performance and
memory utilisation ( apache arrow columnar compression) compared to pandas.
Seems like pyarrow has no support for joining two Tables / Dataset by key
so I have to fallback to pandas.
I don’t really follow how pyarrow <-> pandas integration works. Will pandas
rely on apache arrow data structure? I’m fine with using only these flat
types for columns to avoid "corner cases"
- string
- int
- long
- decimal

I have a feeling that pandas will copy all data from apache arrow and
double the size (according to the doc). Did I get it right?
What is the right way to join, groupBy and filter several "Tables" /
"Datasets" utilizing pyarrow (underlying apache arrow) power?

Thank you!

Mime
View raw message