arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Kim <chairm...@gmail.com>
Subject Re: [Rust] [DataFusion] Reading remote parquet files in S3?
Date Sat, 13 Feb 2021 15:16:57 GMT
> Currently, parquet.rs only supports local disk files. Potentially, this can be done using
the rusoto crate that provides a s3 client. What would be a good way to do this?
> 1. create a remote parquet reader (potentially duplicate lots of code)
> 2. create an interface to abstract away reading from local/remote files (not sure about
performance if the reader blocks on every operation)

This is a great question.

I think that approach (2) is superior, although it requires more work
than approach (1) to design an interface that works well across
multiple file stores that have different performance characteristics.
To accommodate storage-specific performance optimizations, I expect
that the common interface will have to be more elaborate than the
current reader API.

Is it possible for the Rust reader to use the c++ implementation
(https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
If this reuse of implementation is feasible, then we could focus
efforts on improving the c++ implementation and get the benefits in
Python, Rust, etc.

In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
and not well specialized for read patterns that are typical for
Parquet files. We can learn from these mistakes to create a superior
reader interface in the Arrow/Parquet project.

Steve

Mime
View raw message