arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weston Pace <weston.p...@gmail.com>
Subject Re: [python] Table.filter outputs in memory with no option to direct it to memory map
Date Thu, 25 Mar 2021 23:56:43 GMT
@Wes I apologize if this has already been addressed but what do you
envision the query engine's input would be?  Would it be a query in some
Arrow query language (e.g. like SQL)?  A logical query plan?  Or a physical
query plan (admittedly, I'm not too up to speed on the difference between
logical/physical here)?


On Thu, Mar 25, 2021 at 8:20 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi Tom — we need everything to be implemented in C++ so the core execution
> logic isn't dependent on Python. Anyone is welcome to help out — we hope to
> have some of the essential query processing scaffolding bootstrapped in the
> next couple of months and it should be easier for individuals to work in
> parallel on different parts of the system.
>
> Thanks
>
> On Thu, Mar 25, 2021 at 1:21 PM Tom Scheffers <tom@youngbulls.nl> wrote:
>
>> Hi Wes,
>>
>> Since I hear you are also working on the dataframe api for pyarrow, I
>> would like to join efforts 😊.
>>
>> I recently asked a question to the dev mailing list about an alternative
>> to datafusion package (in Rust) for Python. Since the datafusion-python
>> wrapper wasn't performant enough
>> <https://github.com/jorgecarleitao/datafusion-python/issues/4> for my
>> use case, I decided to create my own dataframe engine on top of pyarrow
>> (using numpy + cython) called wombat_db
>> <https://github.com/TomScheffers/wombat>.
>>
>> The main component is the lazy execution engine, which registers pyarrow
>> tables and datasets, from which you can create a query plan. The plan
>> allows you to perform data operations on pyarrow tables directly (joins,
>> aggregates, filters, orderby, etc.), while you can also perform numerical &
>> logical operations on column references directly.
>> When calling collect(), the plan is optimized (pushing down filters &
>> only reading necessary columns) and executed.
>>
>> Although the project is still in an early stage, it is a fast and
>> convenient way of working with pyarrow. I already use it extensively in my
>> company for performing in-memory queries. Due to the internal caching
>> system using subtree matching, it can easily be used as a performant query
>> engine serving user requests.
>>
>> Is this the type of interface what you have in mind for the project? Are
>> you planning on implementing this in C or rather directly in python?
>>
>> Kind regards,
>>
>> Tom Scheffers
>>
>> Op do 25 mrt. 2021 om 17:28 schreef Wes McKinney <wesmckinn@gmail.com>:
>>
>>> This will be new work that we anticipate will be available at some point
>>> in the future (sooner if others help out!).
>>>
>>> You could do this now by hand by breaking a large table into small
>>> chunks, filtering them, then writing each chunk into an output file.
>>>
>>> On Thu, Mar 25, 2021 at 12:21 PM Théo Matussière <theo@huggingface.co>
>>> wrote:
>>>
>>>> Hi Wes, thanks for the quick reply!
>>>> I'm sorry but I'm not sure I understand what you're referring to with "our
>>>> query engine work that's currently percolating". Are you referring to
>>>> ongoing work on Arrow that we can expect to land in the near future, or
>>>> something that's already available that you're working to leverage in your
>>>> own use-case?
>>>> I think the ambiguity for me comes from your example that shows the
>>>> same API as the one that currently exists, so that it's unclear what
>>>> actually makes it a query plan.
>>>> Best,
>>>> Théo
>>>>
>>>> On Thu, Mar 25, 2021 at 4:42 PM Wes McKinney <wesmckinn@gmail.com>
>>>> wrote:
>>>>
>>>>> hi Theo — I think this use case needs to align with our query engine
>>>>> work that's currently percolating. So rather than eagerly evaluating
a
>>>>> filter, instead we would produce a query plan whose sink is an IPC file
or
>>>>> collection of IPC files.
>>>>>
>>>>> So from
>>>>>
>>>>> result = table.filter(boolean_array)
>>>>>
>>>>> to something like
>>>>>
>>>>> filter_step = source.filter(filter_expr)
>>>>> sink_step = write_to_ipc(filter_step, location)
>>>>> sink_step.execute()
>>>>>
>>>>> The filtered version of "source" would never be materialized in
>>>>> memory, so this could run with limited memory footprint
>>>>>
>>>>> On Thu, Mar 25, 2021 at 11:19 AM Théo Matussière <theo@huggingface.co>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> Thanks for all the cool work on Arrow, it's definitely making things
>>>>>> easier for us :)
>>>>>>
>>>>>> I'm wondering if there is a workaround for the current behaviour
of
>>>>>> `Table.filter` that I'm seeing, in that its result goes to RAM even
if the
>>>>>> table is memory mapped.
>>>>>>
>>>>>> Here's an example code to highlight the behaviour:
>>>>>>
>>>>>> [image: Screenshot 2021-03-25 at 16.11.31.png]
>>>>>>
>>>>>> Thanks for the attention!
>>>>>> Théo
>>>>>>
>>>>>
>>
>> --
>>
>> Met vriendelijke groeten / Kind regards,
>>
>> *Tom Scheffers*
>> +316 233 728 15
>>
>>
>> Young Bulls
>>
>> Scheldestraat 21 - 1078 GD Amsterdam
>> <https://www.google.nl/maps/place/Scheldestraat+21,+1078+GD+Amsterdam/@52.3474466,4.889208,17z/data=!3m1!4b1!4m5!3m4!1s0x47c6098acd1783ed:0x649843bbcc11f7cd!8m2!3d52.3474466!4d4.8913967>
>>
>> www.youngbulls.nl
>>
>>

Mime
View raw message