arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Korry Douglas <>
Subject Joining Parquet & PostgreSQL
Date Thu, 15 Nov 2018 17:56:34 GMT
Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that will let PostgreSQL
read Parquet-format files.

I have just a few questions for now:

1) I have created a few sample Parquet data files using AWS Glue.  Glue split my CSV input
into many (48) smaller xxx.snappy.parquet files, each about 30MB. When I open one of these
files using gparquet_arrow_file_reader_new_path(), I can then call gparquet_arrow_file_reader_read_table()
(and then access the content of the table).  However, …_read_table() seems to read the entire
file into memory all at once (I say that based on the amount of time it takes for gparquet_arrow_file_reader_read_table()
to return).   That’s not the behavior I need.

I have tried to use garrow_memory_mappend_input_stream_new() to open the file, followed by
garrow_record_batch_stream_reader_new().  The call to garrow_record_batch_stream_reader_new()
fails with the message:

[record-batch-stream-reader][open]: Invalid: Expected to read 827474256 metadata bytes, but
only read 30284162

Does this error occur because Glue split the input data?  Or because Glue compressed the data
using snappy?  Do I need to uncompress before I can read/open the file?  Do I need to merge
the files before I can open/read the data?
2) If I use garrow_record_batch_stream_reader_new() instead of gparquet_arrow_file_reader_new_path(),
will I avoid the overhead of reading the entire into memory before I fetch the first row?

Thanks in advance for help and any advice.  

            — Korry
View raw message