arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Théo Matussière <t...@huggingface.co>
Subject Re: [python] Table.filter outputs in memory with no option to direct it to memory map
Date Thu, 25 Mar 2021 16:36:44 GMT
Ah yes ok I understand, we might do that indeed, thanks a lot!

On Thu, Mar 25, 2021 at 5:28 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> This will be new work that we anticipate will be available at some point
> in the future (sooner if others help out!).
>
> You could do this now by hand by breaking a large table into small chunks,
> filtering them, then writing each chunk into an output file.
>
> On Thu, Mar 25, 2021 at 12:21 PM Théo Matussière <theo@huggingface.co>
> wrote:
>
>> Hi Wes, thanks for the quick reply!
>> I'm sorry but I'm not sure I understand what you're referring to with "our
>> query engine work that's currently percolating". Are you referring to
>> ongoing work on Arrow that we can expect to land in the near future, or
>> something that's already available that you're working to leverage in your
>> own use-case?
>> I think the ambiguity for me comes from your example that shows the same
>> API as the one that currently exists, so that it's unclear what actually
>> makes it a query plan.
>> Best,
>> Théo
>>
>> On Thu, Mar 25, 2021 at 4:42 PM Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>>> hi Theo — I think this use case needs to align with our query engine
>>> work that's currently percolating. So rather than eagerly evaluating a
>>> filter, instead we would produce a query plan whose sink is an IPC file or
>>> collection of IPC files.
>>>
>>> So from
>>>
>>> result = table.filter(boolean_array)
>>>
>>> to something like
>>>
>>> filter_step = source.filter(filter_expr)
>>> sink_step = write_to_ipc(filter_step, location)
>>> sink_step.execute()
>>>
>>> The filtered version of "source" would never be materialized in memory,
>>> so this could run with limited memory footprint
>>>
>>> On Thu, Mar 25, 2021 at 11:19 AM Théo Matussière <theo@huggingface.co>
>>> wrote:
>>>
>>>> Hi all,
>>>> Thanks for all the cool work on Arrow, it's definitely making things
>>>> easier for us :)
>>>>
>>>> I'm wondering if there is a workaround for the current behaviour of
>>>> `Table.filter` that I'm seeing, in that its result goes to RAM even if the
>>>> table is memory mapped.
>>>>
>>>> Here's an example code to highlight the behaviour:
>>>>
>>>> [image: Screenshot 2021-03-25 at 16.11.31.png]
>>>>
>>>> Thanks for the attention!
>>>> Théo
>>>>
>>>

Mime
View raw message