arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [python] Table.filter outputs in memory with no option to direct it to memory map
Date Thu, 25 Mar 2021 18:19:50 GMT
hi Tom — we need everything to be implemented in C++ so the core execution
logic isn't dependent on Python. Anyone is welcome to help out — we hope to
have some of the essential query processing scaffolding bootstrapped in the
next couple of months and it should be easier for individuals to work in
parallel on different parts of the system.

Thanks

On Thu, Mar 25, 2021 at 1:21 PM Tom Scheffers <tom@youngbulls.nl> wrote:

> Hi Wes,
>
> Since I hear you are also working on the dataframe api for pyarrow, I
> would like to join efforts 😊.
>
> I recently asked a question to the dev mailing list about an alternative
> to datafusion package (in Rust) for Python. Since the datafusion-python
> wrapper wasn't performant enough
> <https://github.com/jorgecarleitao/datafusion-python/issues/4> for my use
> case, I decided to create my own dataframe engine on top of pyarrow (using
> numpy + cython) called wombat_db <https://github.com/TomScheffers/wombat>.
>
> The main component is the lazy execution engine, which registers pyarrow
> tables and datasets, from which you can create a query plan. The plan
> allows you to perform data operations on pyarrow tables directly (joins,
> aggregates, filters, orderby, etc.), while you can also perform numerical &
> logical operations on column references directly.
> When calling collect(), the plan is optimized (pushing down filters & only
> reading necessary columns) and executed.
>
> Although the project is still in an early stage, it is a fast and
> convenient way of working with pyarrow. I already use it extensively in my
> company for performing in-memory queries. Due to the internal caching
> system using subtree matching, it can easily be used as a performant query
> engine serving user requests.
>
> Is this the type of interface what you have in mind for the project? Are
> you planning on implementing this in C or rather directly in python?
>
> Kind regards,
>
> Tom Scheffers
>
> Op do 25 mrt. 2021 om 17:28 schreef Wes McKinney <wesmckinn@gmail.com>:
>
>> This will be new work that we anticipate will be available at some point
>> in the future (sooner if others help out!).
>>
>> You could do this now by hand by breaking a large table into small
>> chunks, filtering them, then writing each chunk into an output file.
>>
>> On Thu, Mar 25, 2021 at 12:21 PM Théo Matussière <theo@huggingface.co>
>> wrote:
>>
>>> Hi Wes, thanks for the quick reply!
>>> I'm sorry but I'm not sure I understand what you're referring to with "our
>>> query engine work that's currently percolating". Are you referring to
>>> ongoing work on Arrow that we can expect to land in the near future, or
>>> something that's already available that you're working to leverage in your
>>> own use-case?
>>> I think the ambiguity for me comes from your example that shows the same
>>> API as the one that currently exists, so that it's unclear what actually
>>> makes it a query plan.
>>> Best,
>>> Théo
>>>
>>> On Thu, Mar 25, 2021 at 4:42 PM Wes McKinney <wesmckinn@gmail.com>
>>> wrote:
>>>
>>>> hi Theo — I think this use case needs to align with our query engine
>>>> work that's currently percolating. So rather than eagerly evaluating a
>>>> filter, instead we would produce a query plan whose sink is an IPC file or
>>>> collection of IPC files.
>>>>
>>>> So from
>>>>
>>>> result = table.filter(boolean_array)
>>>>
>>>> to something like
>>>>
>>>> filter_step = source.filter(filter_expr)
>>>> sink_step = write_to_ipc(filter_step, location)
>>>> sink_step.execute()
>>>>
>>>> The filtered version of "source" would never be materialized in memory,
>>>> so this could run with limited memory footprint
>>>>
>>>> On Thu, Mar 25, 2021 at 11:19 AM Théo Matussière <theo@huggingface.co>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>> Thanks for all the cool work on Arrow, it's definitely making things
>>>>> easier for us :)
>>>>>
>>>>> I'm wondering if there is a workaround for the current behaviour of
>>>>> `Table.filter` that I'm seeing, in that its result goes to RAM even if
the
>>>>> table is memory mapped.
>>>>>
>>>>> Here's an example code to highlight the behaviour:
>>>>>
>>>>> [image: Screenshot 2021-03-25 at 16.11.31.png]
>>>>>
>>>>> Thanks for the attention!
>>>>> Théo
>>>>>
>>>>
>
> --
>
> Met vriendelijke groeten / Kind regards,
>
> *Tom Scheffers*
> +316 233 728 15
>
>
> Young Bulls
>
> Scheldestraat 21 - 1078 GD Amsterdam
> <https://www.google.nl/maps/place/Scheldestraat+21,+1078+GD+Amsterdam/@52.3474466,4.889208,17z/data=!3m1!4b1!4m5!3m4!1s0x47c6098acd1783ed:0x649843bbcc11f7cd!8m2!3d52.3474466!4d4.8913967>
>
> www.youngbulls.nl
>
>

Mime
View raw message