arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn" <uw...@xhochy.com>
Subject Re: Joining Parquet & PostgreSQL
Date Fri, 16 Nov 2018 15:27:21 GMT
Hello Korry,

the C(glib)-API calls the C++ functions in the background, so this only another layer on top.
The parquet::arrow C++ API is built in a way that it does not use C++ exceptions. Instead
if there is a failure, we will return arrow::Status objects indicating this.

Uwe

On Fri, Nov 16, 2018, at 3:27 PM, Korry Douglas wrote:
> Thanks Kouhei and Wes for the fast response, much appreciated.
> 
> C++ is a bit troublesome for me because of the difference between 
> PostgreSQL exception handling (setjmp/longjmp) and C++ exception 
> handling (throw/catch) - I’m worried that destructors might not get 
> invoked properly when cleaning up errors in Postgres.  
> 
> I’ve found very few examples on the web that demonstrate how to use the 
> Parquet C or C++ API’s.  Are you aware of any projects that I might look 
> into to understand how to use the APIs?  Any blogs that might be 
> helpful?
> 
> 
> 
>                    — Korry
> 
> 
> > On Nov 16, 2018, at 8:41 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
> > 
> > That will work, but the size of a single row group could be very large
> > 
> > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176
> > 
> > This function also appears to have a bug in it. If any column is a
> > ChunkedArray after calling ReadRowGroup, then the call to
> > TableBatchReader::ReadNext will return only part of the row group
> > 
> > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L200
> > 
> > I opened https://issues.apache.org/jira/browse/ARROW-3822
> > On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou <kou@clear-code.com> wrote:
> >> 
> >> Hi,
> >> 
> >> I think that we can use
> >> parquet::arrow::FileReader::GetRecordBatchReader()
> >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175
> >> for this propose.
> >> 
> >> It doesn't read the specified number of rows but it'll read
> >> only rows in each row group.
> >> (Do I misunderstand?)
> >> 
> >> 
> >> Thanks,
> >> --
> >> kou
> >> 
> >> In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=CFJyh8KYEg@mail.gmail.com>
> >>  "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500,
> >>  Wes McKinney <wesmckinn@gmail.com> wrote:
> >> 
> >>> garrow_record_batch_stream_reader_new() is for reading files that use
> >>> the stream IPC protocol described in
> >>> https://github.com/apache/arrow/blob/master/format/IPC.md, not for
> >>> Parquet files
> >>> 
> >>> We don't have a streaming reader implemented yet for Parquet files.
> >>> The relevant JIRA (a bit thin on detail) is
> >>> https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean
> >>> to implement this interface, with the option to read some number of
> >>> "rows" at a time:
> >>> 
> >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166
> >>> On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <kou@clear-code.com>
wrote:
> >>>> 
> >>>> Hi,
> >>>> 
> >>>> We didn't implement record batch reader feature for Parquet
> >>>> in C API yet. It's easy to implement. So we can provide the
> >>>> feature in the next release. Can you open a JIRA issue for
> >>>> this feature? You can find "Create" button at
> >>>> https://issues.apache.org/jira/projects/ARROW/issues/
> >>>> 
> >>>> If you can use C++ API, you can use the feature with the
> >>>> current release.
> >>>> 
> >>>> 
> >>>> Thanks,
> >>>> --
> >>>> kou
> >>>> 
> >>>> In <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com>
> >>>>  "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500,
> >>>>  Korry Douglas <korry@me.com> wrote:
> >>>> 
> >>>>> Hi all, I’m exploring the idea of adding a foreign data wrapper
(FDW) that will let PostgreSQL read Parquet-format files.
> >>>>> 
> >>>>> I have just a few questions for now:
> >>>>> 
> >>>>> 1) I have created a few sample Parquet data files using AWS Glue.
 Glue split my CSV input into many (48) smaller xxx.snappy.parquet files, each about 30MB.
When I open one of these files using gparquet_arrow_file_reader_new_path(), I can then call
gparquet_arrow_file_reader_read_table() (and then access the content of the table).  However,
…_read_table() seems to read the entire file into memory all at once (I say that based on
the amount of time it takes for gparquet_arrow_file_reader_read_table() to return).   That’s
not the behavior I need.
> >>>>> 
> >>>>> I have tried to use garrow_memory_mappend_input_stream_new() to
open the file, followed by garrow_record_batch_stream_reader_new().  The call to garrow_record_batch_stream_reader_new()
fails with the message:
> >>>>> 
> >>>>> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256
metadata bytes, but only read 30284162
> >>>>> 
> >>>>> Does this error occur because Glue split the input data?  Or because
Glue compressed the data using snappy?  Do I need to uncompress before I can read/open the
file?  Do I need to merge the files before I can open/read the data?
> >>>>> 
> >>>>> 2) If I use garrow_record_batch_stream_reader_new() instead of gparquet_arrow_file_reader_new_path(),
will I avoid the overhead of reading the entire into memory before I fetch the first row?
> >>>>> 
> >>>>> 
> >>>>> Thanks in advance for help and any advice.
> >>>>> 
> >>>>> 
> >>>>>            ― Korry
> 

Mime
View raw message