hi Theo — I think this use case needs to align with our query engine work that's currently percolating. So rather than eagerly evaluating a filter, instead we would produce a query plan whose sink is an IPC file or collection of IPC files.

So from

result = table.filter(boolean_array)

to something like

filter_step = source.filter(filter_expr)
sink_step = write_to_ipc(filter_step, location)

The filtered version of "source" would never be materialized in memory, so this could run with limited memory footprint

On Thu, Mar 25, 2021 at 11:19 AM Théo Matussière <theo@huggingface.co> wrote:
Hi all,
Thanks for all the cool work on Arrow, it's definitely making things easier for us :)

I'm wondering if there is a workaround for the current behaviour of `Table.filter` that I'm seeing, in that its result goes to RAM even if the table is memory mapped.

Here's an example code to highlight the behaviour:

Screenshot 2021-03-25 at 16.11.31.png

Thanks for the attention!