From user-return-33-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri Nov 16 04:33:48 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5BDD5180669 for ; Fri, 16 Nov 2018 04:33:48 +0100 (CET) Received: (qmail 29070 invoked by uid 500); 16 Nov 2018 03:33:47 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 29058 invoked by uid 99); 16 Nov 2018 03:33:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Nov 2018 03:33:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B1215C1DDE for ; Fri, 16 Nov 2018 03:33:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.201 X-Spam-Level: X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=clear-code.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id D__lX0Snxr3B for ; Fri, 16 Nov 2018 03:33:45 +0000 (UTC) Received: from mail.clear-code.com (mail.clear-code.com [153.126.206.245]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 7D549621F4 for ; Fri, 16 Nov 2018 03:33:44 +0000 (UTC) Received: from localhost (203-165-237-31.rev.home.ne.jp [203.165.237.31]) by mail.clear-code.com (Postfix) with ESMTPSA id 65B9DBDEA0 for ; Fri, 16 Nov 2018 12:33:20 +0900 (JST) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.clear-code.com 65B9DBDEA0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=clear-code.com; s=default; t=1542339200; bh=L+dq1FUmu6AJmAgvs6wKcIb03+jYNo2ciZZRk4hFUkA=; h=Date:To:Subject:From:In-Reply-To:References:From; b=F5kok90h1j7GLpJPcfFXTOxuPSyuc/i/LmuqAwgH/mb3thSKcbZ9AF3HyCQEhV8pE 3KYe8wy430etHqM/gwn+SzCcdFs15QD4oPgmTxbSTdFISuu/e5c6JK1xpdhGbJwaw9 3gmkWNp3qUMLHIwJCfNV6CTE1LeFq9ho8eOSKLeY= Date: Fri, 16 Nov 2018 12:33:17 +0900 (JST) Message-Id: <20181116.123317.2259580748704882243.kou@clear-code.com> To: user@arrow.apache.org Subject: Re: Joining Parquet & PostgreSQL From: Kouhei Sutou In-Reply-To: <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com> References: <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com> X-Mailer: Mew version 6.8 on Emacs 25.2 Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit Hi, We didn't implement record batch reader feature for Parquet in C API yet. It's easy to implement. So we can provide the feature in the next release. Can you open a JIRA issue for this feature? You can find "Create" button at https://issues.apache.org/jira/projects/ARROW/issues/ If you can use C++ API, you can use the feature with the current release. Thanks, -- kou In <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com> "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500, Korry Douglas wrote: > Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that will let PostgreSQL read Parquet-format files. > > I have just a few questions for now: > > 1) I have created a few sample Parquet data files using AWS Glue. Glue split my CSV input into many (48) smaller xxx.snappy.parquet files, each about 30MB. When I open one of these files using gparquet_arrow_file_reader_new_path(), I can then call gparquet_arrow_file_reader_read_table() (and then access the content of the table). However, …_read_table() seems to read the entire file into memory all at once (I say that based on the amount of time it takes for gparquet_arrow_file_reader_read_table() to return). That’s not the behavior I need. > > I have tried to use garrow_memory_mappend_input_stream_new() to open the file, followed by garrow_record_batch_stream_reader_new(). The call to garrow_record_batch_stream_reader_new() fails with the message: > > [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 metadata bytes, but only read 30284162 > > Does this error occur because Glue split the input data? Or because Glue compressed the data using snappy? Do I need to uncompress before I can read/open the file? Do I need to merge the files before I can open/read the data? > > 2) If I use garrow_record_batch_stream_reader_new() instead of gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading the entire into memory before I fetch the first row? > > > Thanks in advance for help and any advice. > > > ― Korry