Since I hear you are also working on the dataframe api for pyarrow, I would like to join efforts 😊.
I recently asked a question to the dev mailing list about an alternative to datafusion package (in Rust) for Python. Since the datafusion-python wrapper wasn't performant enough
for my use case, I decided to create my own dataframe engine on top of pyarrow (using numpy + cython) called wombat_db
The main component is the lazy execution engine, which registers pyarrow tables and datasets, from which you can create a query plan. The plan allows you to perform data operations on pyarrow tables directly (joins, aggregates, filters, orderby, etc.), while you can also perform numerical & logical operations on column references directly.
When calling collect(), the plan is optimized (pushing down filters & only reading necessary columns) and executed.
Although the project is still in an early stage, it is a fast and convenient way of working with pyarrow. I already use it extensively in my company for performing in-memory queries. Due to the internal caching system using subtree matching, it can easily be used as a performant query engine serving user requests.
Is this the type of interface what you have in mind for the project? Are you planning on implementing this in C or rather directly in python?
This will be new work that we anticipate will be available at some point in the future (sooner if others help out!).
You could do this now by hand by breaking a large table into small chunks, filtering them, then writing each chunk into an output file.
Hi Wes, thanks for the quick reply!
I'm sorry but I'm not sure I understand what you're referring to with "our query engine work that's currently percolating". Are you referring to ongoing work on Arrow that we can expect to land in the near future, or something that's already available that you're working to leverage in your own use-case?
I think the ambiguity for me comes from your example that shows the same API as the one that currently exists, so that it's unclear what actually makes it a query plan.
hi Theo — I think this use case needs to align with our query engine work that's currently percolating. So rather than eagerly evaluating a filter, instead we would produce a query plan whose sink is an IPC file or collection of IPC files.
result = table.filter(boolean_array)
to something like
filter_step = source.filter(filter_expr)
sink_step = write_to_ipc(filter_step, location)
The filtered version of "source" would never be materialized in memory, so this could run with limited memory footprint
Thanks for all the cool work on Arrow, it's definitely making things easier for us :)
I'm wondering if there is a workaround for the current behaviour of `Table.filter` that I'm seeing, in that its result goes to RAM even if the table is memory mapped.
Here's an example code to highlight the behaviour:
Thanks for the attention!