arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [Python] Run multiple pc.compute functions on chunks in single pass
Date Wed, 07 Apr 2021 17:53:19 GMT
We are working on implementing a streaming aggregation to be available in
Python but it probably won’t be available until the 5.0 release. I am not
sure solving this problem efficiently is possible at 100GB scale with the
tools currently available in pyarrow.

On Wed, Apr 7, 2021 at 12:41 PM Suresh V <suresh02bt@gmail.com> wrote:

> Hi .. I am trying to compute aggregates on large datasets (100GB) stored
> in parquet format. Current approach is to use scan/fragement to load chunks
> iteratively into memory and would like to run the equivalent of following
> on each chunk using pc.compute functions
>
> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])
>
> My understanding is that pc.compute needs to scan the entire array for
> each of the functions. Please let me know if that is not the case and how
> to optimize it.
>
> Thanks
>

Mime
View raw message