The arrow::RecordBatchReader needs a arrow::dataset::RecordBatchProjector which needs the Schema. It seems that I can't get the schema first and read the streaming parquet by arrow.
In my situation, the parquet file is in the object system like S3. I can get it from the network slice by slice with any filesize, but can't hold the whole file in the memory and disk.
Your reply indicates that the C++ can't read the streaming parquet now, so what should I try next with the arrow or anything else?
Thank you for your work~~
At 2019-11-01 01:46:32, "Wes McKinney" wrote:
>You will want to use the GetRecordBatchReader C++ API here
>
>https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L152
>
>It may not be optimal for your use case. Support for streaming reads
>is not yet exposed in Python or other bindings as far as I know.
>
>There is work happening in the C++ Datasets project to better support
>this use case.
>
>On Wed, Oct 30, 2019 at 9:28 PM annsshadow wrote:
>>
>>
>> hi~
>> I hava a question about reading parquet file.
>> The offical example is reading the whole file from the local.
>> Now I can't get the whole parquet file in the memory, only can fetch it slice by slice from the network, so how can I use arrow to read the parquet file?
>> thank you~