From user-return-1131-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sat Mar 27 09:12:45 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 863B718065D for ; Sat, 27 Mar 2021 10:12:45 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 9E7DB3E9E4 for ; Sat, 27 Mar 2021 09:12:44 +0000 (UTC) Received: (qmail 44183 invoked by uid 500); 27 Mar 2021 09:12:43 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 44173 invoked by uid 99); 27 Mar 2021 09:12:43 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Mar 2021 09:12:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id 8DE51C02D5 for ; Sat, 27 Mar 2021 09:12:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: 0.248 X-Spam-Level: X-Spam-Status: No, score=0.248 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id UmC7ggCkIkAc for ; Sat, 27 Mar 2021 09:12:42 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=116.202.254.214; helo=ciao.gmane.io; envelope-from=gcaau-arrow-user@m.gmane-mx.org; receiver= Received: from ciao.gmane.io (ciao.gmane.io [116.202.254.214]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 799F4BD0A8 for ; Sat, 27 Mar 2021 09:12:41 +0000 (UTC) Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1lQ4zg-0000Kr-QB for user@arrow.apache.org; Sat, 27 Mar 2021 10:12:32 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: user@arrow.apache.org From: Antoine Pitrou Subject: Re: [C++] - Squeeze more out of parquet write(table) operation. Date: Sat, 27 Mar 2021 10:12:27 +0100 Message-ID: <20210327101227.2c80e499@fsol> References: <07C7E799-407E-439E-A9A4-84738F43AF55@icloud.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Newsreader: Claws Mail 3.17.5 (GTK+ 2.24.32; x86_64-pc-linux-gnu) On Fri, 26 Mar 2021 18:47:26 -1000 Weston Pace wrote: > I'm fairly certain there is room for improvement in the C++ > implementation for writing single files to ADLFS. Others can correct > me if I'm wrong but we don't do any kind of pipelined writes. I'd > guess this is partly because there isn't much benefit when writing to > local disk (writes are typically synchronous) but also because it's > much easier to write multiple files. Writes should be asynchronous most of the time. I don't know anything about ADLFS, though. Regards Antoine. >=20 > Is writing multiple files a choice for you? I would guess using a > dataset write with multiple files would be significantly more > efficient than one large single file write on ADLFS. >=20 > -Weston >=20 > On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram = wrote: > > > > Hello, > > > > Thank you again for earlier help on improving overall ADLFS read latenc= y using multiple threads which has worked out really well. > > > > I=E2=80=99ve incorporated buffering on the adls/writer implementation (= upto 64 meg) . What I=E2=80=99m noticing is that the parquet_writer->WriteT= able(table) latency dominates everything else on the output phase of the jo= b (~65sec vs ~1.2min ) . I could use multiple threads (like io/s3fs) but n= ot sure if it will have any effect on parquet write table operation. > > > > Question: Is there anything else I can leverage inside parquet/writer s= ubsystem to improve the core parquet/write/table latency ? > > > > > > schema: > > map>> > > struct<...> > > map>>> > > struct<=E2=80=A6> > > binary > > num_row_groups: 6 > > num_rows_per_row_group: ~8mil > > write buffer size: 64 * 1024 * 1024 (~64 mb) > > write compression: snappy > > total write latency per row group: ~1.2min > > adls append/flush latency (minor factor) > > Azure: ESv3/RAM: 256Gb/Cores: 8 > > > > Yesh =20 >=20