From user-return-38-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Fri Nov 16 16:27:28 2018
Return-Path: <user-return-38-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 62294180670
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 16 Nov 2018 16:27:27 +0100 (CET)
Received: (qmail 57313 invoked by uid 500); 16 Nov 2018 15:27:26 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 57302 invoked by uid 99); 16 Nov 2018 15:27:26 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Nov 2018 15:27:26 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 23826D6CA9
	for <user@arrow.apache.org>; Fri, 16 Nov 2018 15:27:26 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.402
X-Spam-Level:
X-Spam-Status: No, score=-0.402 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, JMQ_SPF_NEUTRAL=0.5, RCVD_IN_DNSWL_LOW=-0.7,
	SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=xhochy.com header.b=HGVnG3t7;
	dkim=pass (2048-bit key) header.d=messagingengine.com
	header.b=Nv2F+dhd
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id 6rBXvKdc0klj for <user@arrow.apache.org>;
	Fri, 16 Nov 2018 15:27:23 +0000 (UTC)
Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 5D0E45FB80
	for <user@arrow.apache.org>; Fri, 16 Nov 2018 15:27:23 +0000 (UTC)
Received: from compute5.internal (compute5.nyi.internal [10.202.2.45])
	by mailout.nyi.internal (Postfix) with ESMTP id 8F63521C6B
	for <user@arrow.apache.org>; Fri, 16 Nov 2018 10:27:22 -0500 (EST)
Received: from web5 ([10.202.2.215])
  by compute5.internal (MEProxy); Fri, 16 Nov 2018 10:27:22 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xhochy.com; h=
	message-id:from:to:mime-version:content-transfer-encoding
	:content-type:in-reply-to:references:date:subject; s=fm1; bh=pM1
	1xFmrNIbOvf2K9DFSKWVleTF2H/Knz4KQLhgS2eI=; b=HGVnG3t777L8DfsFaUw
	NBiiBRSPbeUyHQzDfsD3xClP2Wf3tPs/eKeR9t9RYityf+rp2iuS+75exskTpBUV
	twtOWDNS1iXPj7sO2HWcrpx54YnhaSFW+j7/2Yzug8zIrYh2oZcCeN6h+bawOQS9
	PgHp6QXpwSBWX8d7hUIYG12m40NrZLqi4Mf0ek+i7FFQUVsdL29U9PyzLmxJ0fwk
	dmBWo8RxXtx0JxvV1wc7KYmwlnZTvoSyyeyNg+lysVKnFeKWc+1X5ftf7bVUMU58
	HaKOUADju1Gj1/5XZA1Xp1bQrZ+8FB6sPbTUJKER/EAMFENfoqqA/dnrwFaxyYVy
	Plg==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
	messagingengine.com; h=content-transfer-encoding:content-type
	:date:from:in-reply-to:message-id:mime-version:references
	:subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender
	:x-sasl-enc; s=fm1; bh=pM11xFmrNIbOvf2K9DFSKWVleTF2H/Knz4KQLhgS2
	eI=; b=Nv2F+dhdbl4MRY7EANVqbDHJasjZDbxSyhfr4YqcQBHe6Kl3fx0JPGJmV
	6doKsFVwWG06VGGb1Y9Lp8lyNOYepOKY9g53QVfKKZkkUUbuTf3pBY0wkliRMhSm
	lnk6VVDAH6FDLCDVD8nvH3H58BpVA8FrI0AsYzjmmf+C05j1JYQf+WKG+1w3KJOe
	jSPuurG57W+TztT/HfuD5r3Ex+oRukG8Zxq2zf4YMxxSDUgJJbhzgtF94l0oUSN1
	pBzWtuSvbRIYZsVEMUjF/k8ud8ploREl0AGRDbG/YRLiq7aZMLbNVwj0/qQs16ia
	ynSkLsVsILCiLUaJbQwJG0MleOnpA==
X-ME-Sender: <xms:2uHuW9Te5IgnxxzcqE8gAgpbsgeJvztcTLqF-l01kToG6Pxpa2y3BQ>
X-ME-Proxy: <xmx:2uHuW71sFnpXyqs9XnDXvzu5W1AnQSyIb42dJP2csAc0do-nIKeMBQ>
    <xmx:2uHuW8eGCU9Oxs6iiZ9xTdYdIgB0lleF301iv7U0XaIuVKoG7jQVCw>
    <xmx:2uHuW9VnJNC6EJXbEWtguRzwamyJqWxRDkLFBKBLnWTz9ow5ZESZ4g>
    <xmx:2uHuWyimo712iU7rXD18NfDWToVFZGhhDtx7nuafXbMvYno87dvfXw>
    <xmx:2uHuWxjC2TzSuBAyz03f8sUZwiblnPYiZyOLAQb6ifxopZinU-pmLg>
    <xmx:2uHuW8Xw3u18WrQmc4YxOpxkUR2-OnfTOawZtsnse9OsP9WH9XZ1Jg>
Received: by mailuser.nyi.internal (Postfix, from userid 99)
	id 0691E9E1F6; Fri, 16 Nov 2018 10:27:21 -0500 (EST)
Message-Id: <1542382041.1935824.1579332064.7F4A91CE@webmail.messagingengine.com>
From: "Uwe L. Korn" <uwelk@xhochy.com>
To: user@arrow.apache.org
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
X-Mailer: MessagingEngine.com Webmail Interface - ajax-2d882eb6
In-Reply-To: <C6E4A902-567C-4B26-B83B-AFECF0098869@me.com>
References: <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com>
 <20181116.123317.2259580748704882243.kou@clear-code.com>
 <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=CFJyh8KYEg@mail.gmail.com>
 <20181116.125907.744882695956806428.kou@clear-code.com>
 <CAJPUwMDay0hCVBGqFVysciG4CcHF-7SMrBjJiAxOaeagUYhMnQ@mail.gmail.com>
 <C6E4A902-567C-4B26-B83B-AFECF0098869@me.com>
Date: Fri, 16 Nov 2018 16:27:21 +0100
Subject: Re: Joining Parquet & PostgreSQL

Hello Korry,

the C(glib)-API calls the C++ functions in the background, so this only ano=
ther layer on top. The parquet::arrow C++ API is built in a way that it doe=
s not use C++ exceptions. Instead if there is a failure, we will return arr=
ow::Status objects indicating this.

Uwe

On Fri, Nov 16, 2018, at 3:27 PM, Korry Douglas wrote:
> Thanks Kouhei and Wes for the fast response, much appreciated.
>=20
> C++ is a bit troublesome for me because of the difference between=20
> PostgreSQL exception handling (setjmp/longjmp) and C++ exception=20
> handling (throw/catch) - I=E2=80=99m worried that destructors might not g=
et=20
> invoked properly when cleaning up errors in Postgres.=20=20
>=20
> I=E2=80=99ve found very few examples on the web that demonstrate how to u=
se the=20
> Parquet C or C++ API=E2=80=99s.  Are you aware of any projects that I mig=
ht look=20
> into to understand how to use the APIs?  Any blogs that might be=20
> helpful?
>=20
>=20
>=20
>                    =E2=80=94 Korry
>=20
>=20
> > On Nov 16, 2018, at 8:41 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
> >=20
> > That will work, but the size of a single row group could be very large
> >=20
> > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reade=
r.cc#L176
> >=20
> > This function also appears to have a bug in it. If any column is a
> > ChunkedArray after calling ReadRowGroup, then the call to
> > TableBatchReader::ReadNext will return only part of the row group
> >=20
> > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reade=
r.cc#L200
> >=20
> > I opened https://issues.apache.org/jira/browse/ARROW-3822
> > On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou <kou@clear-code.com> wrot=
e:
> >>=20
> >> Hi,
> >>=20
> >> I think that we can use
> >> parquet::arrow::FileReader::GetRecordBatchReader()
> >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/read=
er.h#L175
> >> for this propose.
> >>=20
> >> It doesn't read the specified number of rows but it'll read
> >> only rows in each row group.
> >> (Do I misunderstand?)
> >>=20
> >>=20
> >> Thanks,
> >> --
> >> kou
> >>=20
> >> In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=3DCFJyh8KYEg@mail.gmail.c=
om>
> >>  "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500,
> >>  Wes McKinney <wesmckinn@gmail.com> wrote:
> >>=20
> >>> garrow_record_batch_stream_reader_new() is for reading files that use
> >>> the stream IPC protocol described in
> >>> https://github.com/apache/arrow/blob/master/format/IPC.md, not for
> >>> Parquet files
> >>>=20
> >>> We don't have a streaming reader implemented yet for Parquet files.
> >>> The relevant JIRA (a bit thin on detail) is
> >>> https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean
> >>> to implement this interface, with the option to read some number of
> >>> "rows" at a time:
> >>>=20
> >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batc=
h.h#L166
> >>> On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <kou@clear-code.com> wr=
ote:
> >>>>=20
> >>>> Hi,
> >>>>=20
> >>>> We didn't implement record batch reader feature for Parquet
> >>>> in C API yet. It's easy to implement. So we can provide the
> >>>> feature in the next release. Can you open a JIRA issue for
> >>>> this feature? You can find "Create" button at
> >>>> https://issues.apache.org/jira/projects/ARROW/issues/
> >>>>=20
> >>>> If you can use C++ API, you can use the feature with the
> >>>> current release.
> >>>>=20
> >>>>=20
> >>>> Thanks,
> >>>> --
> >>>> kou
> >>>>=20
> >>>> In <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com>
> >>>>  "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500,
> >>>>  Korry Douglas <korry@me.com> wrote:
> >>>>=20
> >>>>> Hi all, I=E2=80=99m exploring the idea of adding a foreign data wra=
pper (FDW) that will let PostgreSQL read Parquet-format files.
> >>>>>=20
> >>>>> I have just a few questions for now:
> >>>>>=20
> >>>>> 1) I have created a few sample Parquet data files using AWS Glue.  =
Glue split my CSV input into many (48) smaller xxx.snappy.parquet files, ea=
ch about 30MB. When I open one of these files using gparquet_arrow_file_rea=
der_new_path(), I can then call gparquet_arrow_file_reader_read_table() (an=
d then access the content of the table).  However, =E2=80=A6_read_table() s=
eems to read the entire file into memory all at once (I say that based on t=
he amount of time it takes for gparquet_arrow_file_reader_read_table() to r=
eturn).   That=E2=80=99s not the behavior I need.
> >>>>>=20
> >>>>> I have tried to use garrow_memory_mappend_input_stream_new() to ope=
n the file, followed by garrow_record_batch_stream_reader_new().  The call =
to garrow_record_batch_stream_reader_new() fails with the message:
> >>>>>=20
> >>>>> [record-batch-stream-reader][open]: Invalid: Expected to read 82747=
4256 metadata bytes, but only read 30284162
> >>>>>=20
> >>>>> Does this error occur because Glue split the input data?  Or becaus=
e Glue compressed the data using snappy?  Do I need to uncompress before I =
can read/open the file?  Do I need to merge the files before I can open/rea=
d the data?
> >>>>>=20
> >>>>> 2) If I use garrow_record_batch_stream_reader_new() instead of gpar=
quet_arrow_file_reader_new_path(), will I avoid the overhead of reading the=
 entire into memory before I fetch the first row?
> >>>>>=20
> >>>>>=20
> >>>>> Thanks in advance for help and any advice.
> >>>>>=20
> >>>>>=20
> >>>>>            =E2=80=95 Korry
>=20