arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: Predicate pushdown clarification
Date Tue, 17 Nov 2020 06:57:06 GMT
Hi Andrew,

Thanks for your questions! Some inline answers:

On Sun, 15 Nov 2020 at 23:57, Andrew Campbell <andrewjcampbell1@gmail.com>
wrote:

> Hi Arrow community,
>
> I'm new to the project and am trying to understand exactly what is
> happening under the hood when I run a filter-collect query on an Arrow
> Dataset (backed by Parquet).
>
> Let's say I created a Parquet dataset with no file-level partitions. I
> just wrote a bunch of separate files to a dataset. Now I want to run a
> query that returns the rows corresponding to a specific range of datetimes
> in the dataset's dt column.
>
> My understanding is that the Dataset API will push this query down to the
> file level, checking the footer of each file for the min/max value of dt
> and determining whether this block of rows should be read.
>
Indeed, that understanding is correct.


> Assuming this is correct, a few questions:
>
> Will every query result in the reading all of the file footers? Is there
> any caching of these min/max values?
>
If you are using the same dataset object to do multiple queries, then the
FileMetadata read from the file footers is indeed cached after it is read a
first time.


> Is there a way to profile query performance? A way to view a query plan
> before it is executed?
>
No, at least not yet. I suppose once there is more work on general query
execution (and not only reading/filtering), there will come more tools
around it. But for now, you will need to do with general performance
profiling tools (for python I can recommend py-spy with its mode it also
profile native code, and not only python calls).

Best,
Joris


> I appreciate your time in helping me better understand.
>
> Andrew
>

Mime
View raw message