From user-return-930-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun Jan 24 12:01:42 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 92C0718064D for ; Sun, 24 Jan 2021 13:01:42 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id DD27C63EA3 for ; Sun, 24 Jan 2021 12:01:41 +0000 (UTC) Received: (qmail 84460 invoked by uid 500); 24 Jan 2021 12:01:40 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 84450 invoked by uid 99); 24 Jan 2021 12:01:40 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Jan 2021 12:01:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id E70FDC0115 for ; Sun, 24 Jan 2021 12:01:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=influxdata.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id Pr69bZnuaxrk for ; Sun, 24 Jan 2021 12:01:39 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.221.51; helo=mail-wr1-f51.google.com; envelope-from=alamb@influxdata.com; receiver= Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 7ED41BCBD2 for ; Sun, 24 Jan 2021 12:01:38 +0000 (UTC) Received: by mail-wr1-f51.google.com with SMTP id b5so9446554wrr.10 for ; Sun, 24 Jan 2021 04:01:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=influxdata.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=3+YlqJnroE9iBClTYjuzLtcwmTscqbCZcNFEcBnRYQA=; b=IbmLG44W9rzoaCh2yxaQHb+CaKXlmiXt5uquoqgUJdqlcyWj0t509GqTgjEcijWGky dqSnBHROGmE5uZ2zUrmBGDA8QKBovi/lTtVRsR2tMyianMFpgP4/ZRuqTTmZEmLX+GvD ddiJNdaUUHxxVG4armI+/wXdl3WOITZ5zwkcVUBFkdAjQVdyeMtESKgE8P3ZSW1AjLYY /IAnVy8NAkG4Im7AHZKdPG6mh2V/4RMb0QMq0DT4yXkYISwfzUiLZNQaGckaFclR+vtB 4rWLiV4VSZRA3CEFv4d1e1wzg99hQBlAIo+h7lVghxEPSp1XtuDI29vbrIzCMjC9Cnz6 AEyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=3+YlqJnroE9iBClTYjuzLtcwmTscqbCZcNFEcBnRYQA=; b=Qt0QU++CZgfF7TRgWlr8Oacy+rKFsif3RLHAQxjSU9nktdLRU9+QOkhBJcHRqmkeyx Sf8SR7pdLHcBNznsGO02U2KyOqZ4WHo4AwjqGiCVYPLy1ajUByUDhz1fy6m+od/B2IM9 A6/KaxZOiO0j7Nxo3B/jAFoPJiIPPquV1H8j6k16YNritG3tFSK4c/u33wVTN0MU30Yx fnQz3J/cTm1JJpRiYhVq7Aa1wt1Kl7BiF6t9jB/rYMkoKfPHKnecTl5h1FOrHhgApbAL 4amN2LHoPrxTu7XK7PX/cqEdYCKJxMeADj8JXdHirz9AKPaFcrOisvqTzaaK/WE+yPw7 PsZw== X-Gm-Message-State: AOAM5339YVRL8lH8XBq1Rr3idWbd51MR/OZK9oIvJn9LpfgNhtVAKaSj R2IpKmS+cY1HIb226uJbP8ebQkNrpzQrbrVuVA8vEViu8JQzJA== X-Google-Smtp-Source: ABdhPJxbmsQP0w6Fa1B6QtyNYLxPGdd7OeG2lJnCmV+kUJrWoBk87/s9ko0MzOeJBkpdLb7Qee4n6dpvh+L9Z1PkuEU= X-Received: by 2002:a5d:4203:: with SMTP id n3mr1647009wrq.49.1611489691057; Sun, 24 Jan 2021 04:01:31 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Andrew Lamb Date: Sun, 24 Jan 2021 07:01:20 -0500 Message-ID: Subject: Re: [RUST] Reading parquet To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="0000000000009850aa05b9a42ed2" --0000000000009850aa05b9a42ed2 Content-Type: text/plain; charset="UTF-8" Hi Fernando, Keeping the data in memory as `RecordBatch`es sounds like the way to go if you want it all to be in memory. Another way to work in Rust with data from parquet files is to use the `DataFusion` library; Depending on your needs it might save you some time building up your analytics (e.g. it has aggregations, filtering and sorting built it). Here are some examples of how to use DataFusion with a parquet file (with the dataframe and the SQL api): https://github.com/apache/arrow/blob/master/rust/datafusion/examples/dataframe.rs https://github.com/apache/arrow/blob/master/rust/datafusion/examples/parquet_sql.rs If you already have RecordBatches you can register an in memory table as well. Hope that helps, Andrew On Sat, Jan 23, 2021 at 7:33 AM Fernando Herrera < fernando.j.herrera@gmail.com> wrote: > Hi all, > > A quick question regarding reading a parquet file. What is the best way to > read a parquet file and keep it in memory to do data analysis? > > What I'm doing now is using the record reader from the > ParquetFileArrowReader and then I read all the record batches from the > file. I keep the batches in memory in a vector of record batches. This way > I have access to them to do some aggregations I need from the file. > > Is there another way to do this? > > Thanks, > Fernando > --0000000000009850aa05b9a42ed2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Fernando,

Keeping the data in memory= as `RecordBatch`es sounds like the way to go if you want it all to be in m= emory.=C2=A0

Another way to work in Rust with data f= rom parquet files is to use the `DataFusion` library; Depending on your nee= ds=C2=A0it might save you some time building up your analytics (e.g. it has= aggregations, filtering and sorting built it).

He= re are some examples of how to use DataFusion with a parquet file (with the= dataframe and the SQL api):

If you already have RecordBatches you can register an in memory table as = well.

Hope that helps,
Andrew
=

On Sat, Jan 23, 2021 at 7:33 AM Fernando Herrera <fernando.j.herrera@gmail.com> wr= ote:
Hi all,

A quick question regarding reading a= parquet file. What is the best way to read a parquet file and keep it in m= emory to do data=C2=A0analysis?

What I'm doing=C2=A0= now is using the record reader from the ParquetFileArrowReader and then I r= ead all the record batches from the file. I keep the batches in memory in a= vector of record batches. This way I have access to them to do some aggreg= ations I need from the file.

Is there another way = to do this?

Thanks,
Fernando
--0000000000009850aa05b9a42ed2--