arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kouhei Sutou <...@clear-code.com>
Subject Re: Joining Parquet & PostgreSQL
Date Fri, 16 Nov 2018 03:59:07 GMT
Hi,

I think that we can use
parquet::arrow::FileReader::GetRecordBatchReader()
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175
for this propose.

It doesn't read the specified number of rows but it'll read
only rows in each row group.
(Do I misunderstand?)


Thanks,
--
kou

In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=CFJyh8KYEg@mail.gmail.com>
  "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500,
  Wes McKinney <wesmckinn@gmail.com> wrote:

> garrow_record_batch_stream_reader_new() is for reading files that use
> the stream IPC protocol described in
> https://github.com/apache/arrow/blob/master/format/IPC.md, not for
> Parquet files
> 
> We don't have a streaming reader implemented yet for Parquet files.
> The relevant JIRA (a bit thin on detail) is
> https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean
> to implement this interface, with the option to read some number of
> "rows" at a time:
> 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166
> On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <kou@clear-code.com> wrote:
>>
>> Hi,
>>
>> We didn't implement record batch reader feature for Parquet
>> in C API yet. It's easy to implement. So we can provide the
>> feature in the next release. Can you open a JIRA issue for
>> this feature? You can find "Create" button at
>> https://issues.apache.org/jira/projects/ARROW/issues/
>>
>> If you can use C++ API, you can use the feature with the
>> current release.
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com>
>>   "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500,
>>   Korry Douglas <korry@me.com> wrote:
>>
>> > Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that
will let PostgreSQL read Parquet-format files.
>> >
>> > I have just a few questions for now:
>> >
>> > 1) I have created a few sample Parquet data files using AWS Glue.  Glue split
my CSV input into many (48) smaller xxx.snappy.parquet files, each about 30MB. When I open
one of these files using gparquet_arrow_file_reader_new_path(), I can then call gparquet_arrow_file_reader_read_table()
(and then access the content of the table).  However, …_read_table() seems to read the entire
file into memory all at once (I say that based on the amount of time it takes for gparquet_arrow_file_reader_read_table()
to return).   That’s not the behavior I need.
>> >
>> > I have tried to use garrow_memory_mappend_input_stream_new() to open the file,
followed by garrow_record_batch_stream_reader_new().  The call to garrow_record_batch_stream_reader_new()
fails with the message:
>> >
>> > [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 metadata
bytes, but only read 30284162
>> >
>> > Does this error occur because Glue split the input data?  Or because Glue compressed
the data using snappy?  Do I need to uncompress before I can read/open the file?  Do I need
to merge the files before I can open/read the data?
>> >
>> > 2) If I use garrow_record_batch_stream_reader_new() instead of gparquet_arrow_file_reader_new_path(),
will I avoid the overhead of reading the entire into memory before I fetch the first row?
>> >
>> >
>> > Thanks in advance for help and any advice.
>> >
>> >
>> >             ― Korry
Mime
View raw message