arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lamb <al...@influxdata.com>
Subject Re: [Rust] [DataFusion] Reading remote parquet files in S3?
Date Mon, 15 Feb 2021 11:09:44 GMT
I don't know of any examples in the DataFusion codebase that take a
ChunkReader directly

The cloudfuse-io code defines the ChunkReader trait for the `CachedFile`
here :
https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/clients/cached_file.rs#L10

On Mon, Feb 15, 2021 at 4:19 AM Jack Chan <j4ck.cyw@gmail.com> wrote:

> Thanks Andrew.
>
> As you mentioned, the ChunkReader is flexible enough. So, what is missing
> is a way to provider an parquet reader implementation of a customized
> ChunkReader. Are there any examples within datafusion where people can
> change the execution plan like this?
>
> If I understand correctly, the steps cloudfuse-io took are 1. define a s3
> parquet table provider. [1] 2. define a s3 parquet reader. [2] This does
> confirm my understanding that creating your own remote parquet reader
> requires lots of duplication.
>
> [1]
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs
> [2]
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs
>
> Jack
>
> Andrew Lamb <alamb@influxdata.com> 於 2021年2月14日週日 上午2:14寫道:
>
>> The Buzz project is one example I know of that reads parquet files from
>> S3 using the Rust implementation
>>
>>
>> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs
>>
>> The SerializedFileReader[1] from the Rust parquet crate, despite its
>> somewhat misleading name, doesn't have to read from files, instead it reads
>> from something that implements the ChunkReader [2] trait. I am not sure how
>> well this matches what you are looking for.
>>
>> Hope that helps,
>> Andrew
>>
>> [1]
>> https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html
>> [2]
>> https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html
>>
>>
>>
>> On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <chairmank@gmail.com> wrote:
>>
>>> > Currently, parquet.rs only supports local disk files. Potentially,
>>> this can be done using the rusoto crate that provides a s3 client. What
>>> would be a good way to do this?
>>> > 1. create a remote parquet reader (potentially duplicate lots of code)
>>> > 2. create an interface to abstract away reading from local/remote
>>> files (not sure about performance if the reader blocks on every operation)
>>>
>>> This is a great question.
>>>
>>> I think that approach (2) is superior, although it requires more work
>>> than approach (1) to design an interface that works well across
>>> multiple file stores that have different performance characteristics.
>>> To accommodate storage-specific performance optimizations, I expect
>>> that the common interface will have to be more elaborate than the
>>> current reader API.
>>>
>>> Is it possible for the Rust reader to use the c++ implementation
>>> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
>>> If this reuse of implementation is feasible, then we could focus
>>> efforts on improving the c++ implementation and get the benefits in
>>> Python, Rust, etc.
>>>
>>> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
>>> the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
>>> and not well specialized for read patterns that are typical for
>>> Parquet files. We can learn from these mistakes to create a superior
>>> reader interface in the Arrow/Parquet project.
>>>
>>> Steve
>>>
>>

Mime
View raw message