arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lamb <>
Subject Re: [Rust] [DataFusion] Reading remote parquet files in S3?
Date Sun, 14 Feb 2021 10:13:57 GMT
The Buzz project is one example I know of that reads parquet files from S3
using the Rust implementation

The SerializedFileReader[1] from the Rust parquet crate, despite its
somewhat misleading name, doesn't have to read from files, instead it reads
from something that implements the ChunkReader [2] trait. I am not sure how
well this matches what you are looking for.

Hope that helps,


On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <> wrote:

> > Currently, only supports local disk files. Potentially, this
> can be done using the rusoto crate that provides a s3 client. What would be
> a good way to do this?
> > 1. create a remote parquet reader (potentially duplicate lots of code)
> > 2. create an interface to abstract away reading from local/remote files
> (not sure about performance if the reader blocks on every operation)
> This is a great question.
> I think that approach (2) is superior, although it requires more work
> than approach (1) to design an interface that works well across
> multiple file stores that have different performance characteristics.
> To accommodate storage-specific performance optimizations, I expect
> that the common interface will have to be more elaborate than the
> current reader API.
> Is it possible for the Rust reader to use the c++ implementation
> (
> If this reuse of implementation is feasible, then we could focus
> efforts on improving the c++ implementation and get the benefits in
> Python, Rust, etc.
> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
> the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
> and not well specialized for read patterns that are typical for
> Parquet files. We can learn from these mistakes to create a superior
> reader interface in the Arrow/Parquet project.
> Steve

View raw message