arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Théo Matussière <t...@huggingface.co>
Subject Re: [python] Table.filter outputs in memory with no option to direct it to memory map
Date Thu, 25 Mar 2021 16:20:38 GMT
Hi Wes, thanks for the quick reply!
I'm sorry but I'm not sure I understand what you're referring to with "our
query engine work that's currently percolating". Are you referring to
ongoing work on Arrow that we can expect to land in the near future, or
something that's already available that you're working to leverage in your
own use-case?
I think the ambiguity for me comes from your example that shows the same
API as the one that currently exists, so that it's unclear what actually
makes it a query plan.
Best,
Théo

On Thu, Mar 25, 2021 at 4:42 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi Theo — I think this use case needs to align with our query engine work
> that's currently percolating. So rather than eagerly evaluating a filter,
> instead we would produce a query plan whose sink is an IPC file or
> collection of IPC files.
>
> So from
>
> result = table.filter(boolean_array)
>
> to something like
>
> filter_step = source.filter(filter_expr)
> sink_step = write_to_ipc(filter_step, location)
> sink_step.execute()
>
> The filtered version of "source" would never be materialized in memory, so
> this could run with limited memory footprint
>
> On Thu, Mar 25, 2021 at 11:19 AM Théo Matussière <theo@huggingface.co>
> wrote:
>
>> Hi all,
>> Thanks for all the cool work on Arrow, it's definitely making things
>> easier for us :)
>>
>> I'm wondering if there is a workaround for the current behaviour of
>> `Table.filter` that I'm seeing, in that its result goes to RAM even if the
>> table is memory mapped.
>>
>> Here's an example code to highlight the behaviour:
>>
>> [image: Screenshot 2021-03-25 at 16.11.31.png]
>>
>> Thanks for the attention!
>> Théo
>>
>

Mime
View raw message