arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Scheffers <>
Subject Re: [python] Table.filter outputs in memory with no option to direct it to memory map
Date Thu, 25 Mar 2021 17:21:10 GMT
Hi Wes,

Since I hear you are also working on the dataframe api for pyarrow, I would
like to join efforts 😊.

I recently asked a question to the dev mailing list about an alternative to
datafusion package (in Rust) for Python. Since the datafusion-python
wrapper wasn't performant enough
<> for my use
case, I decided to create my own dataframe engine on top of pyarrow (using
numpy + cython) called wombat_db <>.

The main component is the lazy execution engine, which registers pyarrow
tables and datasets, from which you can create a query plan. The plan
allows you to perform data operations on pyarrow tables directly (joins,
aggregates, filters, orderby, etc.), while you can also perform numerical &
logical operations on column references directly.
When calling collect(), the plan is optimized (pushing down filters & only
reading necessary columns) and executed.

Although the project is still in an early stage, it is a fast and
convenient way of working with pyarrow. I already use it extensively in my
company for performing in-memory queries. Due to the internal caching
system using subtree matching, it can easily be used as a performant query
engine serving user requests.

Is this the type of interface what you have in mind for the project? Are
you planning on implementing this in C or rather directly in python?

Kind regards,

Tom Scheffers

Op do 25 mrt. 2021 om 17:28 schreef Wes McKinney <>:

> This will be new work that we anticipate will be available at some point
> in the future (sooner if others help out!).
> You could do this now by hand by breaking a large table into small chunks,
> filtering them, then writing each chunk into an output file.
> On Thu, Mar 25, 2021 at 12:21 PM Théo Matussière <>
> wrote:
>> Hi Wes, thanks for the quick reply!
>> I'm sorry but I'm not sure I understand what you're referring to with "our
>> query engine work that's currently percolating". Are you referring to
>> ongoing work on Arrow that we can expect to land in the near future, or
>> something that's already available that you're working to leverage in your
>> own use-case?
>> I think the ambiguity for me comes from your example that shows the same
>> API as the one that currently exists, so that it's unclear what actually
>> makes it a query plan.
>> Best,
>> Théo
>> On Thu, Mar 25, 2021 at 4:42 PM Wes McKinney <> wrote:
>>> hi Theo — I think this use case needs to align with our query engine
>>> work that's currently percolating. So rather than eagerly evaluating a
>>> filter, instead we would produce a query plan whose sink is an IPC file or
>>> collection of IPC files.
>>> So from
>>> result = table.filter(boolean_array)
>>> to something like
>>> filter_step = source.filter(filter_expr)
>>> sink_step = write_to_ipc(filter_step, location)
>>> sink_step.execute()
>>> The filtered version of "source" would never be materialized in memory,
>>> so this could run with limited memory footprint
>>> On Thu, Mar 25, 2021 at 11:19 AM Théo Matussière <>
>>> wrote:
>>>> Hi all,
>>>> Thanks for all the cool work on Arrow, it's definitely making things
>>>> easier for us :)
>>>> I'm wondering if there is a workaround for the current behaviour of
>>>> `Table.filter` that I'm seeing, in that its result goes to RAM even if the
>>>> table is memory mapped.
>>>> Here's an example code to highlight the behaviour:
>>>> [image: Screenshot 2021-03-25 at 16.11.31.png]
>>>> Thanks for the attention!
>>>> Théo


Met vriendelijke groeten / Kind regards,

*Tom Scheffers*
+316 233 728 15

Young Bulls

Scheldestraat 21 - 1078 GD Amsterdam

View raw message