From user-return-1030-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Mar 1 10:32:06 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id D0AA518062C for ; Mon, 1 Mar 2021 11:32:06 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 50A4F63FAD for ; Mon, 1 Mar 2021 10:31:59 +0000 (UTC) Received: (qmail 93719 invoked by uid 500); 1 Mar 2021 10:31:58 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 93705 invoked by uid 99); 1 Mar 2021 10:31:58 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Mar 2021 10:31:58 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id C0CC5C033F for ; Mon, 1 Mar 2021 10:31:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: -0.201 X-Spam-Level: X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id D22Y_jg8fr24 for ; Mon, 1 Mar 2021 10:31:57 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::e35; helo=mail-vs1-xe35.google.com; envelope-from=jacek.pliszka@gmail.com; receiver= Received: from mail-vs1-xe35.google.com (mail-vs1-xe35.google.com [IPv6:2607:f8b0:4864:20::e35]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 2A4787FCB9 for ; Mon, 1 Mar 2021 10:31:57 +0000 (UTC) Received: by mail-vs1-xe35.google.com with SMTP id l192so8350487vsd.5 for ; Mon, 01 Mar 2021 02:31:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=wbEshU2hOIbfN2ox9zD1zxM/18J5UJhHOP9XVtOF0K8=; b=dJZiuTE8hTZZ98olE6VkGvWyMPq8xLAN8MjPnaN24bpJQEq2gV6dv/qEHD/qFKaDRm NXTcrelrgM4+aOPjEb45V0iTlRHi2KQ4YeqF2wHJMacNKwwhf+UlkNHHw3qZHI4PJ1ZC Yul19N/46KDlXC53/2G1C/ocbdLWr2Orb1x3Q+n5H8jnbPg16ZiHSFChI4cf7kGmH2/K /fL3lWCfWz830AZOD64VlYsyJ2BZwZBgVAIMSjJid4E0AYvQAh/cBzmFPuonzeZGymbH j1PR7Fwu1C9rjCTgs8OUcFgo+mWVdxYHSRIvrCQMxs+t3tKMDJKXXLyGxBtihduWWVOh MxCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=wbEshU2hOIbfN2ox9zD1zxM/18J5UJhHOP9XVtOF0K8=; b=CLQAvwjRCa0O5Begsx+kSWagjliy2o4EZeZYzJOEY1zGi1zOP/jBJA8ShHZ/tc7x7f ZIUvC6rpFgMjMZl6hZ3CRMut8wBGXC48kcf20QERo4x6u7WOZkrIfjwkx71DCqA/EvDF aPghv9rN6IiNwnlc7tMEP85V+GG+ZUOXM4Bx3lTonkqckJ7wROW/PJD9XBTToL8v3FlY wMTtHgCwGIAbAoG5j0kec2r9lefT4+SKIWG9vu6qL0U4jDFnUyG5wkzwQMLBxjX6RiYX Q7tSUvIaFV/XsRKqqbIfOTbzRHu19rnLhowKi7ddIoZUpfP16LlzoIIl8cH7A/k+KaMh R5LQ== X-Gm-Message-State: AOAM532dhkgpz0zf34SKdDotm3/UT6rN91S6FNlkaOyw1y+gSJgQbpE+ WJyhUdZFgKwL1UDUk+b+u0Trhwfz1yNeQJrbFtxeS0Azv6w= X-Google-Smtp-Source: ABdhPJxPJ75Bt17O+lgSmAIm+qMsorsbiOk+FMv2IaPPGIF2kRbjSjM+soWBdT74L3rO9p36f3rBqyg57i0AQvsR0zg= X-Received: by 2002:a67:cb87:: with SMTP id h7mr7110351vsl.33.1614594709577; Mon, 01 Mar 2021 02:31:49 -0800 (PST) MIME-Version: 1.0 References: <60498dc3e9c278b1a1369776af9eb341d3158f18.camel@cnrgh.fr> In-Reply-To: <60498dc3e9c278b1a1369776af9eb341d3158f18.camel@cnrgh.fr> From: Jacek Pliszka Date: Mon, 1 Mar 2021 11:31:38 +0100 Message-ID: Subject: Re: why that take so many times to read parquets file with 300 000 columns To: user@arrow.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Other will probably give you better hints but You do not need to convert to Pandas. read in arrow and convert to numpy directly if numpy is what you want. BR, Jacek pon., 1 mar 2021 o 11:24 jonathan mercier napisa=C5=82(a): > > Dear, > > I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow > thus I own a table with 300 000 columns and around 45 000 row of > presence/absence (0/1). It is a file of ~150 Mo. > > I read this file like this: > > import pyarrow.parquet as pq > data =3D > pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_) > > And this statement take 1 hour =E2=80=A6 > So is there a trick to speedup to load in memory those data ? > Is it possible to distribute the loading with a library such as ray ? > > thanks > > Best regards > > > -- > Researcher computational biology > PhD, Jonathan MERCIER > > Bioinformatics (LBI) > 2, rue Gaston > Cr=C3=A9mieux > 91057 Evry Cedex > > > Tel :(+33)1 60 87 83 44 > Email :jonathan.mercier@cnrgh.fr > > >