From user-return-38-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri Nov 16 16:27:28 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 62294180670 for ; Fri, 16 Nov 2018 16:27:27 +0100 (CET) Received: (qmail 57313 invoked by uid 500); 16 Nov 2018 15:27:26 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 57302 invoked by uid 99); 16 Nov 2018 15:27:26 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Nov 2018 15:27:26 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 23826D6CA9 for ; Fri, 16 Nov 2018 15:27:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.402 X-Spam-Level: X-Spam-Status: No, score=-0.402 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, JMQ_SPF_NEUTRAL=0.5, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=xhochy.com header.b=HGVnG3t7; dkim=pass (2048-bit key) header.d=messagingengine.com header.b=Nv2F+dhd Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 6rBXvKdc0klj for ; Fri, 16 Nov 2018 15:27:23 +0000 (UTC) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 5D0E45FB80 for ; Fri, 16 Nov 2018 15:27:23 +0000 (UTC) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id 8F63521C6B for ; Fri, 16 Nov 2018 10:27:22 -0500 (EST) Received: from web5 ([10.202.2.215]) by compute5.internal (MEProxy); Fri, 16 Nov 2018 10:27:22 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xhochy.com; h= message-id:from:to:mime-version:content-transfer-encoding :content-type:in-reply-to:references:date:subject; s=fm1; bh=pM1 1xFmrNIbOvf2K9DFSKWVleTF2H/Knz4KQLhgS2eI=; b=HGVnG3t777L8DfsFaUw NBiiBRSPbeUyHQzDfsD3xClP2Wf3tPs/eKeR9t9RYityf+rp2iuS+75exskTpBUV twtOWDNS1iXPj7sO2HWcrpx54YnhaSFW+j7/2Yzug8zIrYh2oZcCeN6h+bawOQS9 PgHp6QXpwSBWX8d7hUIYG12m40NrZLqi4Mf0ek+i7FFQUVsdL29U9PyzLmxJ0fwk dmBWo8RxXtx0JxvV1wc7KYmwlnZTvoSyyeyNg+lysVKnFeKWc+1X5ftf7bVUMU58 HaKOUADju1Gj1/5XZA1Xp1bQrZ+8FB6sPbTUJKER/EAMFENfoqqA/dnrwFaxyYVy Plg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; bh=pM11xFmrNIbOvf2K9DFSKWVleTF2H/Knz4KQLhgS2 eI=; b=Nv2F+dhdbl4MRY7EANVqbDHJasjZDbxSyhfr4YqcQBHe6Kl3fx0JPGJmV 6doKsFVwWG06VGGb1Y9Lp8lyNOYepOKY9g53QVfKKZkkUUbuTf3pBY0wkliRMhSm lnk6VVDAH6FDLCDVD8nvH3H58BpVA8FrI0AsYzjmmf+C05j1JYQf+WKG+1w3KJOe jSPuurG57W+TztT/HfuD5r3Ex+oRukG8Zxq2zf4YMxxSDUgJJbhzgtF94l0oUSN1 pBzWtuSvbRIYZsVEMUjF/k8ud8ploREl0AGRDbG/YRLiq7aZMLbNVwj0/qQs16ia ynSkLsVsILCiLUaJbQwJG0MleOnpA== X-ME-Sender: X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 99) id 0691E9E1F6; Fri, 16 Nov 2018 10:27:21 -0500 (EST) Message-Id: <1542382041.1935824.1579332064.7F4A91CE@webmail.messagingengine.com> From: "Uwe L. Korn" To: user@arrow.apache.org MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Mailer: MessagingEngine.com Webmail Interface - ajax-2d882eb6 In-Reply-To: References: <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com> <20181116.123317.2259580748704882243.kou@clear-code.com> <20181116.125907.744882695956806428.kou@clear-code.com> Date: Fri, 16 Nov 2018 16:27:21 +0100 Subject: Re: Joining Parquet & PostgreSQL Hello Korry, the C(glib)-API calls the C++ functions in the background, so this only ano= ther layer on top. The parquet::arrow C++ API is built in a way that it doe= s not use C++ exceptions. Instead if there is a failure, we will return arr= ow::Status objects indicating this. Uwe On Fri, Nov 16, 2018, at 3:27 PM, Korry Douglas wrote: > Thanks Kouhei and Wes for the fast response, much appreciated. >=20 > C++ is a bit troublesome for me because of the difference between=20 > PostgreSQL exception handling (setjmp/longjmp) and C++ exception=20 > handling (throw/catch) - I=E2=80=99m worried that destructors might not g= et=20 > invoked properly when cleaning up errors in Postgres.=20=20 >=20 > I=E2=80=99ve found very few examples on the web that demonstrate how to u= se the=20 > Parquet C or C++ API=E2=80=99s. Are you aware of any projects that I mig= ht look=20 > into to understand how to use the APIs? Any blogs that might be=20 > helpful? >=20 >=20 >=20 > =E2=80=94 Korry >=20 >=20 > > On Nov 16, 2018, at 8:41 AM, Wes McKinney wrote: > >=20 > > That will work, but the size of a single row group could be very large > >=20 > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reade= r.cc#L176 > >=20 > > This function also appears to have a bug in it. If any column is a > > ChunkedArray after calling ReadRowGroup, then the call to > > TableBatchReader::ReadNext will return only part of the row group > >=20 > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reade= r.cc#L200 > >=20 > > I opened https://issues.apache.org/jira/browse/ARROW-3822 > > On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou wrot= e: > >>=20 > >> Hi, > >>=20 > >> I think that we can use > >> parquet::arrow::FileReader::GetRecordBatchReader() > >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/read= er.h#L175 > >> for this propose. > >>=20 > >> It doesn't read the specified number of rows but it'll read > >> only rows in each row group. > >> (Do I misunderstand?) > >>=20 > >>=20 > >> Thanks, > >> -- > >> kou > >>=20 > >> In > >> "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500, > >> Wes McKinney wrote: > >>=20 > >>> garrow_record_batch_stream_reader_new() is for reading files that use > >>> the stream IPC protocol described in > >>> https://github.com/apache/arrow/blob/master/format/IPC.md, not for > >>> Parquet files > >>>=20 > >>> We don't have a streaming reader implemented yet for Parquet files. > >>> The relevant JIRA (a bit thin on detail) is > >>> https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean > >>> to implement this interface, with the option to read some number of > >>> "rows" at a time: > >>>=20 > >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batc= h.h#L166 > >>> On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou wr= ote: > >>>>=20 > >>>> Hi, > >>>>=20 > >>>> We didn't implement record batch reader feature for Parquet > >>>> in C API yet. It's easy to implement. So we can provide the > >>>> feature in the next release. Can you open a JIRA issue for > >>>> this feature? You can find "Create" button at > >>>> https://issues.apache.org/jira/projects/ARROW/issues/ > >>>>=20 > >>>> If you can use C++ API, you can use the feature with the > >>>> current release. > >>>>=20 > >>>>=20 > >>>> Thanks, > >>>> -- > >>>> kou > >>>>=20 > >>>> In <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com> > >>>> "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500, > >>>> Korry Douglas wrote: > >>>>=20 > >>>>> Hi all, I=E2=80=99m exploring the idea of adding a foreign data wra= pper (FDW) that will let PostgreSQL read Parquet-format files. > >>>>>=20 > >>>>> I have just a few questions for now: > >>>>>=20 > >>>>> 1) I have created a few sample Parquet data files using AWS Glue. = Glue split my CSV input into many (48) smaller xxx.snappy.parquet files, ea= ch about 30MB. When I open one of these files using gparquet_arrow_file_rea= der_new_path(), I can then call gparquet_arrow_file_reader_read_table() (an= d then access the content of the table). However, =E2=80=A6_read_table() s= eems to read the entire file into memory all at once (I say that based on t= he amount of time it takes for gparquet_arrow_file_reader_read_table() to r= eturn). That=E2=80=99s not the behavior I need. > >>>>>=20 > >>>>> I have tried to use garrow_memory_mappend_input_stream_new() to ope= n the file, followed by garrow_record_batch_stream_reader_new(). The call = to garrow_record_batch_stream_reader_new() fails with the message: > >>>>>=20 > >>>>> [record-batch-stream-reader][open]: Invalid: Expected to read 82747= 4256 metadata bytes, but only read 30284162 > >>>>>=20 > >>>>> Does this error occur because Glue split the input data? Or becaus= e Glue compressed the data using snappy? Do I need to uncompress before I = can read/open the file? Do I need to merge the files before I can open/rea= d the data? > >>>>>=20 > >>>>> 2) If I use garrow_record_batch_stream_reader_new() instead of gpar= quet_arrow_file_reader_new_path(), will I avoid the overhead of reading the= entire into memory before I fetch the first row? > >>>>>=20 > >>>>>=20 > >>>>> Thanks in advance for help and any advice. > >>>>>=20 > >>>>>=20 > >>>>> =E2=80=95 Korry >=20