arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh V <suresh0...@gmail.com>
Subject [Python] Run multiple pc.compute functions on chunks in single pass
Date Wed, 07 Apr 2021 17:41:14 GMT
Hi .. I am trying to compute aggregates on large datasets (100GB) stored in
parquet format. Current approach is to use scan/fragement to load chunks
iteratively into memory and would like to run the equivalent of following
on each chunk using pc.compute functions

df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])

My understanding is that pc.compute needs to scan the entire array for each
of the functions. Please let me know if that is not the case and how to
optimize it.

Thanks

Mime
View raw message