arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <>
Subject Re: Re: [C++] How can I read streaming parquet file in v0.15.0
Date Thu, 07 Nov 2019 06:47:03 GMT
I'm not sure what is meant by "streaming" in this  context.  My
understanding is that Parquet file reading needs RandomAccess.  In this
regard if you are trying to fetch from S3  A RandomAccessFile object using
the S3FileSystem
then create a Parquet file reader with the object.  I'm not sure if this
code path has been well tested.

On Fri, Nov 1, 2019 at 12:56 AM annsshadow <> wrote:

> The arrow::RecordBatchReader needs a arrow::dataset::RecordBatchProjector
> which needs the Schema. It seems that I can't get the schema first and read
> the streaming parquet by arrow.<br/>In my situation, the parquet file is in
> the object system like S3. I can get it from the network slice by slice
> with any filesize, but can't hold the whole file in the memory and
> disk.<br/>Your reply indicates that the C++ can't read the streaming
> parquet now, so what should I try next with the arrow or anything
> else?<br/>Thank you for your work~~
> At 2019-11-01 01:46:32, "Wes McKinney" <> wrote:
> >You will want to use the GetRecordBatchReader C++ API here
> >
> >
> >
> >It may not be optimal for your use case. Support for streaming reads
> >is not yet exposed in Python or other bindings as far as I know.
> >
> >There is work happening in the C++ Datasets project to better support
> >this use case.
> >
> >On Wed, Oct 30, 2019 at 9:28 PM annsshadow <> wrote:
> >>
> >>
> >> hi~
> >> I hava a question about reading parquet file.
> >> The offical example is reading the whole file from the local.
> >> Now I can't get the whole parquet file in the memory, only can fetch it
> slice by slice from the network, so how can I use arrow to read the parquet
> file?
> >> thank you~

View raw message