arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lamb <al...@influxdata.com>
Subject Re: [Rust] [DataFusion] Reading remote parquet files in S3?
Date Sun, 14 Feb 2021 10:13:57 GMT
The Buzz project is one example I know of that reads parquet files from S3
using the Rust implementation

https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs

The SerializedFileReader[1] from the Rust parquet crate, despite its
somewhat misleading name, doesn't have to read from files, instead it reads
from something that implements the ChunkReader [2] trait. I am not sure how
well this matches what you are looking for.

Hope that helps,
Andrew

[1]
https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html
[2] https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html



On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <chairmank@gmail.com> wrote:

> > Currently, parquet.rs only supports local disk files. Potentially, this
> can be done using the rusoto crate that provides a s3 client. What would be
> a good way to do this?
> > 1. create a remote parquet reader (potentially duplicate lots of code)
> > 2. create an interface to abstract away reading from local/remote files
> (not sure about performance if the reader blocks on every operation)
>
> This is a great question.
>
> I think that approach (2) is superior, although it requires more work
> than approach (1) to design an interface that works well across
> multiple file stores that have different performance characteristics.
> To accommodate storage-specific performance optimizations, I expect
> that the common interface will have to be more elaborate than the
> current reader API.
>
> Is it possible for the Rust reader to use the c++ implementation
> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
> If this reuse of implementation is feasible, then we could focus
> efforts on improving the c++ implementation and get the benefits in
> Python, Rust, etc.
>
> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
> the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
> and not well specialized for read patterns that are typical for
> Parquet files. We can learn from these mistakes to create a superior
> reader interface in the Arrow/Parquet project.
>
> Steve
>

Mime
View raw message