From user-return-33-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Fri Nov 16 04:33:48 2018
Return-Path: <user-return-33-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 5BDD5180669
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 16 Nov 2018 04:33:48 +0100 (CET)
Received: (qmail 29070 invoked by uid 500); 16 Nov 2018 03:33:47 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 29058 invoked by uid 99); 16 Nov 2018 03:33:47 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Nov 2018 03:33:47 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B1215C1DDE
	for <user@arrow.apache.org>; Fri, 16 Nov 2018 03:33:46 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.201
X-Spam-Level:
X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (1024-bit key) header.d=clear-code.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id D__lX0Snxr3B for <user@arrow.apache.org>;
	Fri, 16 Nov 2018 03:33:45 +0000 (UTC)
Received: from mail.clear-code.com (mail.clear-code.com [153.126.206.245])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 7D549621F4
	for <user@arrow.apache.org>; Fri, 16 Nov 2018 03:33:44 +0000 (UTC)
Received: from localhost (203-165-237-31.rev.home.ne.jp [203.165.237.31])
	by mail.clear-code.com (Postfix) with ESMTPSA id 65B9DBDEA0
	for <user@arrow.apache.org>; Fri, 16 Nov 2018 12:33:20 +0900 (JST)
DKIM-Filter: OpenDKIM Filter v2.11.0 mail.clear-code.com 65B9DBDEA0
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=clear-code.com;
	s=default; t=1542339200;
	bh=L+dq1FUmu6AJmAgvs6wKcIb03+jYNo2ciZZRk4hFUkA=;
	h=Date:To:Subject:From:In-Reply-To:References:From;
	b=F5kok90h1j7GLpJPcfFXTOxuPSyuc/i/LmuqAwgH/mb3thSKcbZ9AF3HyCQEhV8pE
	 3KYe8wy430etHqM/gwn+SzCcdFs15QD4oPgmTxbSTdFISuu/e5c6JK1xpdhGbJwaw9
	 3gmkWNp3qUMLHIwJCfNV6CTE1LeFq9ho8eOSKLeY=
Date: Fri, 16 Nov 2018 12:33:17 +0900 (JST)
Message-Id: <20181116.123317.2259580748704882243.kou@clear-code.com>
To: user@arrow.apache.org
Subject: Re: Joining Parquet & PostgreSQL
From: Kouhei Sutou <kou@clear-code.com>
In-Reply-To: <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com>
References: <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com>
X-Mailer: Mew version 6.8 on Emacs 25.2
Mime-Version: 1.0
Content-Type: Text/Plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit

Hi,

We didn't implement record batch reader feature for Parquet
in C API yet. It's easy to implement. So we can provide the
feature in the next release. Can you open a JIRA issue for
this feature? You can find "Create" button at
https://issues.apache.org/jira/projects/ARROW/issues/

If you can use C++ API, you can use the feature with the
current release.


Thanks,
--
kou

In <1E5D30AE-80FB-41DC-93DD-EA2261C852CB@me.com>
  "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500,
  Korry Douglas <korry@me.com> wrote:

> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that will let PostgreSQL read Parquet-format files.
> 
> I have just a few questions for now:
> 
> 1) I have created a few sample Parquet data files using AWS Glue.  Glue split my CSV input into many (48) smaller xxx.snappy.parquet files, each about 30MB. When I open one of these files using gparquet_arrow_file_reader_new_path(), I can then call gparquet_arrow_file_reader_read_table() (and then access the content of the table).  However, …_read_table() seems to read the entire file into memory all at once (I say that based on the amount of time it takes for gparquet_arrow_file_reader_read_table() to return).   That’s not the behavior I need.
> 
> I have tried to use garrow_memory_mappend_input_stream_new() to open the file, followed by garrow_record_batch_stream_reader_new().  The call to garrow_record_batch_stream_reader_new() fails with the message:
> 
> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 metadata bytes, but only read 30284162
> 
> Does this error occur because Glue split the input data?  Or because Glue compressed the data using snappy?  Do I need to uncompress before I can read/open the file?  Do I need to merge the files before I can open/read the data?
>  
> 2) If I use garrow_record_batch_stream_reader_new() instead of gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading the entire into memory before I fetch the first row?
> 
> 
> Thanks in advance for help and any advice.  
> 
> 
>             ― Korry