arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Scheffers <...@youngbulls.nl>
Subject Re: [python] Table.filter outputs in memory with no option to direct it to memory map
Date Thu, 25 Mar 2021 17:21:10 GMT
Hi Wes,

Since I hear you are also working on the dataframe api for pyarrow, I would
like to join efforts 😊.

I recently asked a question to the dev mailing list about an alternative to
datafusion package (in Rust) for Python. Since the datafusion-python
wrapper wasn't performant enough
<https://github.com/jorgecarleitao/datafusion-python/issues/4> for my use
case, I decided to create my own dataframe engine on top of pyarrow (using
numpy + cython) called wombat_db <https://github.com/TomScheffers/wombat>.

The main component is the lazy execution engine, which registers pyarrow
tables and datasets, from which you can create a query plan. The plan
allows you to perform data operations on pyarrow tables directly (joins,
aggregates, filters, orderby, etc.), while you can also perform numerical &
logical operations on column references directly.
When calling collect(), the plan is optimized (pushing down filters & only
reading necessary columns) and executed.

Although the project is still in an early stage, it is a fast and
convenient way of working with pyarrow. I already use it extensively in my
company for performing in-memory queries. Due to the internal caching
system using subtree matching, it can easily be used as a performant query
engine serving user requests.

Is this the type of interface what you have in mind for the project? Are
you planning on implementing this in C or rather directly in python?

Kind regards,

Tom Scheffers

Op do 25 mrt. 2021 om 17:28 schreef Wes McKinney <wesmckinn@gmail.com>:

> This will be new work that we anticipate will be available at some point
> in the future (sooner if others help out!).
>
> You could do this now by hand by breaking a large table into small chunks,
> filtering them, then writing each chunk into an output file.
>
> On Thu, Mar 25, 2021 at 12:21 PM Théo Matussière <theo@huggingface.co>
> wrote:
>
>> Hi Wes, thanks for the quick reply!
>> I'm sorry but I'm not sure I understand what you're referring to with "our
>> query engine work that's currently percolating". Are you referring to
>> ongoing work on Arrow that we can expect to land in the near future, or
>> something that's already available that you're working to leverage in your
>> own use-case?
>> I think the ambiguity for me comes from your example that shows the same
>> API as the one that currently exists, so that it's unclear what actually
>> makes it a query plan.
>> Best,
>> Théo
>>
>> On Thu, Mar 25, 2021 at 4:42 PM Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>>> hi Theo — I think this use case needs to align with our query engine
>>> work that's currently percolating. So rather than eagerly evaluating a
>>> filter, instead we would produce a query plan whose sink is an IPC file or
>>> collection of IPC files.
>>>
>>> So from
>>>
>>> result = table.filter(boolean_array)
>>>
>>> to something like
>>>
>>> filter_step = source.filter(filter_expr)
>>> sink_step = write_to_ipc(filter_step, location)
>>> sink_step.execute()
>>>
>>> The filtered version of "source" would never be materialized in memory,
>>> so this could run with limited memory footprint
>>>
>>> On Thu, Mar 25, 2021 at 11:19 AM Théo Matussière <theo@huggingface.co>
>>> wrote:
>>>
>>>> Hi all,
>>>> Thanks for all the cool work on Arrow, it's definitely making things
>>>> easier for us :)
>>>>
>>>> I'm wondering if there is a workaround for the current behaviour of
>>>> `Table.filter` that I'm seeing, in that its result goes to RAM even if the
>>>> table is memory mapped.
>>>>
>>>> Here's an example code to highlight the behaviour:
>>>>
>>>> [image: Screenshot 2021-03-25 at 16.11.31.png]
>>>>
>>>> Thanks for the attention!
>>>> Théo
>>>>
>>>

-- 

Met vriendelijke groeten / Kind regards,

*Tom Scheffers*
+316 233 728 15


Young Bulls

Scheldestraat 21 - 1078 GD Amsterdam
<https://www.google.nl/maps/place/Scheldestraat+21,+1078+GD+Amsterdam/@52.3474466,4.889208,17z/data=!3m1!4b1!4m5!3m4!1s0x47c6098acd1783ed:0x649843bbcc11f7cd!8m2!3d52.3474466!4d4.8913967>

www.youngbulls.nl

Mime
View raw message