From user-return-922-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Jan 21 12:19:13 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 3563F180645 for ; Thu, 21 Jan 2021 13:19:13 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id A039565A5C for ; Thu, 21 Jan 2021 12:19:12 +0000 (UTC) Received: (qmail 90129 invoked by uid 500); 21 Jan 2021 12:19:11 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 90118 invoked by uid 99); 21 Jan 2021 12:19:11 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jan 2021 12:19:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id B53CC1FF39B for ; Thu, 21 Jan 2021 12:19:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -3.567 X-Spam-Level: X-Spam-Status: No, score=-3.567 tagged_above=-999 required=6.31 tests=[KAM_DMARC_STATUS=0.01, NICE_REPLY_A=-3.576, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id LaO_lAvDvysk for ; Thu, 21 Jan 2021 12:19:10 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=193.50.0.66; helo=sirona.cnrgh.fr; envelope-from=jonathan.mercier@cnrgh.fr; receiver= Received: from sirona.cnrgh.fr (sirona.cnrgh.fr [193.50.0.66]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTP id EBCB0BCBD2 for ; Thu, 21 Jan 2021 12:19:09 +0000 (UTC) Received: from [192.168.66.249] (I0017033.windows.cng.fr [192.168.66.249]) (Authenticated sender: jmercier) by sirona.cnrgh.fr (Postfix) with ESMTPSA id E56D3E14BB for ; Thu, 21 Jan 2021 13:19:08 +0100 (CET) Subject: Re: How to make a parquet dataset from an input file through Random access To: user@arrow.apache.org References: <0ea88c7faf7a3da88e864517bc239ff3685ccaae.camel@cnrgh.fr> From: Jonathan MERCIER Message-ID: Date: Thu, 21 Jan 2021 13:19:07 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 MIME-Version: 1.0 In-Reply-To: <0ea88c7faf7a3da88e864517bc239ff3685ccaae.camel@cnrgh.fr> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: fr Content-Transfer-Encoding: 8bit Same question but more simple to understand. Using pyarrow and working with pieces of data by process (multi-process as workaround GIL limitation). What is the correct way to handle this task ? 1. each // process have to create create a list of records store them into a record batch and return this batch 2. each // process have to create an output and writer buffer , create a list of records store them into a record batch and write this record batch into the stream writer. The process return the corresponding buffer ? with the answer (1) I see how to merge all of those batch but with solution (2) how to merge all buffer to one once each process has returned their buffer ? Thanks -- Jonathan MERCIER Researcher computational biology PhD, Jonathan MERCIER Centre National de Recherche en Génomique Humaine (CNRGH) Bioinformatics (LBI) 2, rue Gaston Crémieux 91057 Evry Cedex Tel :(33) 1 60 87 34 88 Email :jonathan.mercier@cnrgh.fr