From user-return-347-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Mar 18 10:30:56 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id E6BB918025F for ; Wed, 18 Mar 2020 11:30:55 +0100 (CET) Received: (qmail 66368 invoked by uid 500); 18 Mar 2020 10:30:54 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 66358 invoked by uid 99); 18 Mar 2020 10:30:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Mar 2020 10:30:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 517DDC1E80 for ; Wed, 18 Mar 2020 10:30:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.011 X-Spam-Level: X-Spam-Status: No, score=0.011 tagged_above=-999 required=6.31 tests=[KAM_DMARC_STATUS=0.01, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id xGu45JSkHy-5 for ; Wed, 18 Mar 2020 10:30:53 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=193.50.0.65; helo=courrier.cng.fr; envelope-from=jonathan.mercier@cnrgh.fr; receiver= Received: from courrier.cng.fr (courrier.cng.fr [193.50.0.65]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTP id 7AA417E133 for ; Wed, 18 Mar 2020 10:30:52 +0000 (UTC) Received: from bioinfornatics (vir91-1-82-228-212-97.fbx.proxad.net [82.228.212.97]) by courrier.cng.fr (Postfix) with ESMTP id 18F9C25806A for ; Wed, 18 Mar 2020 11:30:52 +0100 (CET) Message-ID: Subject: Learning pyarrow and optimize row groups size From: jonathan mercier To: user@arrow.apache.org Date: Wed, 18 Mar 2020 11:30:41 +0100 Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.32.5 (3.32.5-1.fc30) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Dear, I am learning pyarrow API and arrow tecnology. So I would like first to thank you for your works. From my understanding pyarrow.arrays, pyarrow.RecordBatch are write only structure. We can not append data. 1/ is it correct ? I wrote a little script to write data into parquet file. The data is a 2D list ( a list of rows which contains a list of columns [['a','b','c'], ['d','e','f']]) Script is here: https://gist.github.com/bioinfornatics/c82398fa22339d34f41b3580c988c308 To obtain this goal I stored in memory all intermediate pyarrow structures in order to create a table (schema and list of pyarrow array) 2/ is it possible to reach the same goal with a stream in order to not waste memory/handle terabyte of data ? I read these interesting articles: https://www.dremio.com/tuning-parquet/, https://parquet.apache.org/documentation/latest/ which recommends large row groups (512MB - 1GB). 3/ how to manage row group in order to feat approximately the size 1GB ? 4) using pyarrow should store at end (on disk) to a parquet file or pyarrow provide its generic file as common data layer? Thanks a lot for your help and your works on arrow Best regards Jonathan