Hi Andrew,

Thanks for your questions! Some inline answers:

On Sun, 15 Nov 2020 at 23:57, Andrew Campbell <andrewjcampbell1@gmail.com> wrote:

Hi Arrow community,

I'm new to the project and am trying to understand exactly what is happening under the hood when I run a filter-collect query on an Arrow Dataset (backed by Parquet).

Let's say I created a Parquet dataset with no file-level partitions. I just wrote a bunch of separate files to a dataset. Now I want to run a query that returns the rows corresponding to a specific range of datetimes in the dataset's dt column.

My understanding is that the Dataset API will push this query down to the file level, checking the footer of each file for the min/max value of dt and determining whether this block of rows should be read.

Indeed, that understanding is correct.

Assuming this is correct, a few questions:

Will every query result in the reading all of the file footers? Is there any caching of these min/max values?

If you are using the same dataset object to do multiple queries, then the FileMetadata read from the file footers is indeed cached after it is read a first time.

Is there a way to profile query performance? A way to view a query plan before it is executed?

No, at least not yet. I suppose once there is more work on general query execution (and not only reading/filtering), there will come more tools around it. But for now, you will need to do with general performance profiling tools (for python I can recommend py-spy with its mode it also profile native code, and not only python calls).


I appreciate your time in helping me better understand.