arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh V <suresh0...@gmail.com>
Subject Re: [Python] Run multiple pc.compute functions on chunks in single pass
Date Wed, 07 Apr 2021 19:36:05 GMT
Thank you very much for the response @wesm. Looking forward to the changes
and hopefully gaining enough knowledge to start contributing to the
project. I am planning to go the cython route with a custom aggregator for
now. Tbh I am not sure how much we gain by doing single pass vs potential
loss due to cpu friendly  vectorization.

On Wed, Apr 7, 2021 at 1:53 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> We are working on implementing a streaming aggregation to be available in
> Python but it probably won’t be available until the 5.0 release. I am not
> sure solving this problem efficiently is possible at 100GB scale with the
> tools currently available in pyarrow.
>
> On Wed, Apr 7, 2021 at 12:41 PM Suresh V <suresh02bt@gmail.com> wrote:
>
>> Hi .. I am trying to compute aggregates on large datasets (100GB) stored
>> in parquet format. Current approach is to use scan/fragement to load chunks
>> iteratively into memory and would like to run the equivalent of following
>> on each chunk using pc.compute functions
>>
>> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])
>>
>> My understanding is that pc.compute needs to scan the entire array for
>> each of the functions. Please let me know if that is not the case and how
>> to optimize it.
>>
>> Thanks
>>
>

Mime
View raw message