I don't know of any examples in the DataFusion codebase that take a ChunkReader directly

The cloudfuse-io code defines the ChunkReader trait for the `CachedFile` here :https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/clients/cached_file.rs#L10

On Mon, Feb 15, 2021 at 4:19 AM Jack Chan <j4ck.cyw@gmail.com> wrote:
Thanks Andrew.

As you mentioned, the ChunkReader is flexible enough. So, what is missing is a way to provider an parquet reader implementation of a customized ChunkReader. Are there any examples within datafusion where people can change the execution plan like this?

If I understand correctly, the steps cloudfuse-io took are 1. define a s3 parquet table provider. [1] 2. define a s3 parquet reader. [2] This does confirm my understanding that creating your own remote parquet reader requires lots of duplication.

[1] https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs
[2] https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs

Jack

Andrew Lamb <alamb@influxdata.com> 於 2021年2月14日週日 上午2:14寫道:
The Buzz project is one example I know of that reads parquet files from S3 using the Rust implementation


The SerializedFileReader[1] from the Rust parquet crate, despite its somewhat misleading name, doesn't have to read from files, instead it reads from something that implements the ChunkReader [2] trait. I am not sure how well this matches what you are looking for. 

Hope that helps,
Andrew




On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <chairmank@gmail.com> wrote:
> Currently, parquet.rs only supports local disk files. Potentially, this can be done using the rusoto crate that provides a s3 client. What would be a good way to do this?
> 1. create a remote parquet reader (potentially duplicate lots of code)
> 2. create an interface to abstract away reading from local/remote files (not sure about performance if the reader blocks on every operation)

This is a great question.

I think that approach (2) is superior, although it requires more work
than approach (1) to design an interface that works well across
multiple file stores that have different performance characteristics.
To accommodate storage-specific performance optimizations, I expect
that the common interface will have to be more elaborate than the
current reader API.

Is it possible for the Rust reader to use the c++ implementation
(https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
If this reuse of implementation is feasible, then we could focus
efforts on improving the c++ implementation and get the benefits in
Python, Rust, etc.

In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
and not well specialized for read patterns that are typical for
Parquet files. We can learn from these mistakes to create a superior
reader interface in the Arrow/Parquet project.

Steve