From user-return-847-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Sat Dec 12 13:10:04 2020
Return-Path: <user-return-847-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 96B48180648
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 12 Dec 2020 14:10:04 +0100 (CET)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id C191E46A93
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 12 Dec 2020 13:10:03 +0000 (UTC)
Received: (qmail 55317 invoked by uid 500); 12 Dec 2020 13:10:03 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 55307 invoked by uid 99); 12 Dec 2020 13:10:02 -0000
Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2020 13:10:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 2FAE21FF39A
	for <user@arrow.apache.org>; Sat, 12 Dec 2020 13:10:02 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamproc1-he-de.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024)
	with ESMTP id PH64m7vLsJhh for <user@arrow.apache.org>;
	Sat, 12 Dec 2020 13:10:00 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::b35; helo=mail-yb1-xb35.google.com; envelope-from=danielheres@gmail.com; receiver=<UNKNOWN> 
Received: from mail-yb1-xb35.google.com (mail-yb1-xb35.google.com [IPv6:2607:f8b0:4864:20::b35])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id B074F7F9E9
	for <user@arrow.apache.org>; Sat, 12 Dec 2020 13:09:59 +0000 (UTC)
Received: by mail-yb1-xb35.google.com with SMTP id l14so10817614ybq.3
        for <user@arrow.apache.org>; Sat, 12 Dec 2020 05:09:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=aedqYj3WttcNGjuzeNJJxXRjGpksW3qQguGNu4kMxCk=;
        b=t4DeSbXXZBlXj1olQjVwcyQQ8ugG1RUorEPUcq/axyCJLkb1e8zwnvEZHBUgfjywPE
         GcE4ExMSzM9lHLfMUGgPYYqoUiJrqR0pOqggcaJKice8ScwVE7PsFEMfrRtogRpm/j8T
         MAyfseotCGl64XzJkmL21CJD8dAWVPtPC05o0MTy8+BcVcGnHzar0xeLHKuU0+ybRzkS
         QlZapomvR2BrNfrDm63FY/41zkjEqkTRuPY85+dHUjIPZf0RE7j8n/TIXNnsGPNJMADt
         7G6p+2v4MuV/uKPGtxlnUdZg/KIrtpIPTWGdCrY8KBzLZaEAp/AbxVY6ps1SF+NADSvH
         w6KQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=aedqYj3WttcNGjuzeNJJxXRjGpksW3qQguGNu4kMxCk=;
        b=IIKNheQbZSf75nV7F/4qm8HHX12tF1vkHTtfh7a0wpXGJpv1VEkOpvZOWhnDkXfmDB
         8aMHzCOkiiduCgfUkCXxBDprs/xZLFTZQnSqZ++5LQ1ogdAecmlxSfP/0UNQzwdfTpBF
         ayJiNFesm+lPgboni9fk7+UUHnCUuYOCo4VYegRJDxxFEos6GBVP2HE5aDpCQl7vcgXO
         MrJjDlljYjdjwd0p2loNQZbXAOWYWPxdf9DZ8ahmEV+qjqTcH43/ODLs0ouoqdTfyubF
         XS/ra1m49FAneKBgXjQTXAxSvvfJ/B2rpkaVLDt22AfbximlM0XX0/8wCAVeRQlen0Se
         jtsg==
X-Gm-Message-State: AOAM530+yItoi4Ww0lTuqj/Sva/+ucnYqtDXoj+g7ySsS4zYJYYczfsd
	sCuzWC+2ygQIF/YH7O1nYKW4IyzCwtd5qgS4Jilk3Xrkovmn3A==
X-Google-Smtp-Source: ABdhPJxfYpu55MFd9m5JALitMJIQxSC2kbGBuruI0Z2ANjjTczWP51xzBBBsA07/MeLvv5iD+ttCVR8aW1zxROBTTXY=
X-Received: by 2002:a25:ca14:: with SMTP id a20mr6518193ybg.303.1607778598198;
 Sat, 12 Dec 2020 05:09:58 -0800 (PST)
MIME-Version: 1.0
References: <MN2PR06MB598482549B749B5A064785FBCDCB0@MN2PR06MB5984.namprd06.prod.outlook.com>
 <CA+EpF8s9Rgc+=yFyqHhWWBAm2F3hj_nLg_X+uvx0fG1t1wLxdQ@mail.gmail.com>
 <MN2PR06MB5984EB6857B8949D24BD8B02CDCA0@MN2PR06MB5984.namprd06.prod.outlook.com>
 <MN2PR06MB5984BF19CF0F368D34A4DDBBCDCA0@MN2PR06MB5984.namprd06.prod.outlook.com>
 <CAJEf=X4RBRyv2=DOOU14fBf9vDdck1y+yNUaCwsYOvE=vkxedw@mail.gmail.com>
 <MN2PR06MB59849E4EF7D4C9DFECB634A4CDCA0@MN2PR06MB5984.namprd06.prod.outlook.com>
 <CAOYPqDAzbG=U0rE_pbrPZE1_-kjjgR0jr+ApEP2fxH_H9bM7MQ@mail.gmail.com>
In-Reply-To: <CAOYPqDAzbG=U0rE_pbrPZE1_-kjjgR0jr+ApEP2fxH_H9bM7MQ@mail.gmail.com>
From: =?UTF-8?Q?Dani=C3=ABl_Heres?= <danielheres@gmail.com>
Date: Sat, 12 Dec 2020 14:09:47 +0100
Message-ID: <CABJrs=3HzWFpK_EtNiT9=o8y33Gcb9W5ZPPPKhovd53ninGdjw@mail.gmail.com>
Subject: Re: [Rust] DataFusion performance
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="00000000000039237505b64420ef"

--00000000000039237505b64420ef
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hello Matthew,

If you want to try to get absolutely the best performance you can get now
from DataFusion:

* Make sure you are using the latest version from master, there have been a
lot of improvements lately.

* Compile DataFusion with "simd" feature on. This requires a recent version
of DataFusion, but it gives speeds for some computations.

* Compile your code with lto =3D true like this in your Cargo.toml file:

[profile.release]
lto =3D true

This will increase the compile time considerably, but allows Rust / LLVM to
do more optimizations on the entire program. There are some other settings
documented here
https://doc.rust-lang.org/cargo/reference/profiles.html#release

* Set the environment variable RUSTFLAGS=3D"-C target-cpu=3Dnative". this
allows Rust/LLVM to use all CPU instructions available on your CPU. This
way the binary becomes not portable anymore though.

We are also improving the performance over time, e.g. recently a lot parts
in Arrow / DataFusion have been improved in the last months such as faster
CSV reader and faster computations, and there is still a lot to come.

Best,

Dani=C3=ABl

Op za 12 dec. 2020 om 05:06 schreef Jorge Cardoso Leit=C3=A3o <
jorgecarleitao@gmail.com>:

> Hi Mattew,
>
> SchemaRef is just an alias for Arc<Schema>. Thus, you need to wrap it on
> an Arc.
>
> We do this because the plans are often passed between thread boundaries
> and thus wrapping them on an Arc allows that.
>
> Best,
> Jorge
>
>
> On Fri, Dec 11, 2020 at 8:14 PM Matthew Turner <
> matthew.m.turner@outlook.com> wrote:
>
>> Thanks! Converting the schema to owned made it work.
>>
>>
>>
>> The type of the schema param is SchemaRef =E2=80=93 which I thought woul=
d allow a
>> reference.  Is this not the case?
>>
>>
>>
>> *Matthew M. Turner*
>>
>> Email*:* matthew.m.turner@outlook.com
>>
>> Phone: (908)-868-2786
>>
>>
>>
>> *From:* Andy Grove <andygrove73@gmail.com>
>> *Sent:* Friday, December 11, 2020 10:16 AM
>> *To:* user@arrow.apache.org
>> *Subject:* Re: [Rust] DataFusion performance
>>
>>
>>
>> Hi Matthew,
>>
>>
>>
>> Using the latest DataFusion from GitHub master branch, the following cod=
e
>> works for in-memory:
>>
>> use std::sync::Arc;
>> use std::time::Instant;
>>
>> use datafusion::error::Result;
>> use datafusion::prelude::*;
>> use datafusion::datasource::MemTable;
>>
>> #[tokio::main]
>> async fn main() -> Result<()> {
>>     //
>> *TODO add command-line args    *let ratings_csv =3D "/tmp/movies/ratings=
_small.csv";
>>     let mut ctx =3D ExecutionContext::*new*();
>>     let df =3D ctx.read_csv(ratings_csv, CsvReadOptions::*new*()).unwrap=
();
>>     let batches =3D vec![df.collect().await?];
>>     let provider =3D MemTable::*new*(Arc::*new*(df.schema().to_owned().i=
nto()), batches)?;
>>     ctx.register_table("memory_table", Box::*new*(provider));
>>     let mem_df =3D ctx.table("memory_table")?;
>>     let q_start =3D Instant::*now*();
>>     let _results =3D mem_df
>>         .filter(col("userId").eq(lit(1)))?
>>         .collect()
>>         .await
>>         .unwrap();
>>     println!("Duration: {:?}", q_start.elapsed());
>>     *Ok*(())
>> }
>>
>>
>>
>> Andy.
>>
>>
>>
>> On Fri, Dec 11, 2020 at 7:59 AM Matthew Turner <
>> matthew.m.turner@outlook.com> wrote:
>>
>> Played around some more - it was because I wasn=E2=80=99t using --releas=
e flag.
>> Sry about that, still learning rust.
>>
>> Using that flag, the total time to read and filter is between 52 and 80m=
s.
>>
>> In general, what should I expect when comparing the performance of panda=
s
>> to datafusion?
>>
>> @Andy Grove thanks for adding that.  If there is a need for additional
>> datafusion benchmarks and what I do could help with that then I would be
>> happy to contribute it.  I will send a follow up once ive made progress.
>>
>> I'm also still having trouble with that memory table, so any help there
>> is appreciated.
>>
>> Thanks for your time!  Very excited by this.
>>
>> Matthew M. Turner
>> Email: matthew.m.turner@outlook.com
>> Phone: (908)-868-2786
>>
>> -----Original Message-----
>> From: Matthew Turner <matthew.m.turner@outlook.com>
>> Sent: Friday, December 11, 2020 12:24 AM
>> To: user@arrow.apache.org
>> Subject: RE: [Rust] DataFusion performance
>>
>> Thanks for context! Makes sense.
>>
>> Even with that, when comparing the total time of each (read + filter)
>> DataFusion still appears much slower(~625ms vs 33ms).  Is that expected?
>>
>> Also, im trying to bring the table in memory now to perform the
>> computation from there and compare performance.  Code below.  But I'm
>> getting an error (beneath the code) even though I think ive constructed =
the
>> MemTable correctly (from [1]).  From what I see all the types are the sa=
me
>> as when I used the original df from read_csv so I'm not sure what I'm do=
ing
>> wrong.
>>
>> I also saw there was an open issue [2] for this error type raised on
>> rust-lang - so im unsure if its my implementation, datafusion/arrow issu=
e,
>> or Rust issue.
>>
>> Thanks again for help!
>>
>> ```
>>     let sch =3D Arc::new(df.schema());
>>     let batches =3D vec![df.collect().await?];
>>     let provider =3D MemTable::new(sch, batches)?;
>>
>>     ctx.register_table("memory_table", Box::new(provider));
>>
>>     let mem_df =3D ctx.table("memory_table")?;
>>
>>     let q_start =3D Instant::now();
>>     let results =3D mem_df
>>         .filter(col("userId").eq(lit(1)))?
>>         .collect()
>>         .await
>>         .unwrap();
>> ```
>>
>> Which is returning this error:
>>
>> error[E0698]: type inside `async` block must be known in this context
>>   --> src\main.rs:37:38
>>    |
>> 37 |         .filter(col("userId").eq(lit(1)))?
>>    |                                      ^ cannot infer type for type
>> `{integer}`
>>    |
>> note: the type is part of the `async` block because of this `await`
>>   --> src\main.rs:36:19
>>    |
>> 36 |       let results =3D mem_df
>>    |  ___________________^
>> 37 | |         .filter(col("userId").eq(lit(1)))?
>> 38 | |         .collect()
>> 39 | |         .await
>>    | |______________^
>>
>>
>> [1]
>> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/dat=
aframe_in_memory.rs
>> [2] https://github.com/rust-lang/rust/issues/63502
>>
>> Matthew M. Turner
>> Email: matthew.m.turner@outlook.com
>> Phone: (908)-868-2786
>>
>> -----Original Message-----
>> From: Michael Mior <mmior@apache.org>
>> Sent: Thursday, December 10, 2020 8:55 PM
>> To: user@arrow.apache.org
>> Subject: Re: [Rust] DataFusion performance
>>
>> Contrary to what you might expect given the name, read_csv does not
>> actually read the CSV file. It instead creates the start of a logical
>> execution plan which involves reading the CSV file when that plan is
>> finally executed. This happens when you call collect().
>>
>> Pandas read_csv on the other hand immediately reads the CSV file. So
>> you're comparing the time of reading AND filtering the file
>> (DataFusion) with the time to filter data which has already been read
>> (Pandas).
>>
>> There's nothing wrong with your use of DataFusion per se, you simply
>> weren't measuring what you thought you were measuring.
>> --
>> Michael Mior
>> mmior@apache.org
>>
>> Le jeu. 10 d=C3=A9c. 2020 =C3=A0 17:11, Matthew Turner <
>> matthew.m.turner@outlook.com> a =C3=A9crit :
>> >
>> > Hello,
>> >
>> >
>> >
>> > I=E2=80=99ve been playing around with DataFusion to explore the feasib=
ility of
>> replacing current python/pandas data processing jobs with Rust/datafusio=
n.
>> Ultimately, looking to improve performance / decrease cost.
>> >
>> >
>> >
>> > I was doing some simple tests to start to measure performance
>> differences on a simple task (read a csv[1] and filter it).
>> >
>> >
>> >
>> > Reading the csv datafusion seemed to outperform pandas by around 30%
>> which was nice.
>> >
>> > *Rust took around 20-25ms to read the csv (compared to 32ms from
>> > pandas)
>> >
>> >
>> >
>> > However, when filtering the data I was surprised to see that pandas wa=
s
>> way faster.
>> >
>> > *Rust took around 500-600ms to filter the csv(compared to 1ms from
>> > pandas)
>> >
>> >
>> >
>> > My code for each is below.  I know I should be running the DataFusion
>> times through something similar to pythons %timeit but I didn=E2=80=99t =
have that
>> immediately accessible and I ran many times to confirm it was roughly
>> consistent.
>> >
>> >
>> >
>> > Is this performance expected? Or am I using datafusion incorrectly?
>> >
>> >
>> >
>> > Any insight is much appreciated!
>> >
>> >
>> >
>> > [Rust]
>> >
>> > ```
>> >
>> > use datafusion::error::Result;
>> >
>> > use datafusion::prelude::*;
>> >
>> > use std::time::Instant;
>> >
>> >
>> >
>> > #[tokio::main]
>> >
>> > async fn main() -> Result<()> {
>> >
>> >     let start =3D Instant::now();
>> >
>> >
>> >
>> >     let mut ctx =3D ExecutionContext::new();
>> >
>> >
>> >
>> >     let ratings_csv =3D "ratings_small.csv";
>> >
>> >
>> >
>> >     let df =3D ctx.read_csv(ratings_csv,
>> > CsvReadOptions::new()).unwrap();
>> >
>> >     println!("Read CSV Duration: {:?}", start.elapsed());
>> >
>> >
>> >
>> >     let q_start =3D Instant::now();
>> >
>> >     let results =3D df
>> >
>> >         .filter(col("userId").eq(lit(1)))?
>> >
>> >         .collect()
>> >
>> >         .await
>> >
>> >         .unwrap();
>> >
>> >     println!("Filter duration: {:?}", q_start.elapsed());
>> >
>> >
>> >
>> >     println!("Duration: {:?}", start.elapsed());
>> >
>> >
>> >
>> >     Ok(())
>> >
>> > }
>> >
>> > ```
>> >
>> >
>> >
>> > [Python]
>> >
>> > ```
>> >
>> > In [1]: df =3D pd.read_csv(=E2=80=9Cratings_small.csv=E2=80=9D)
>> >
>> > 32.4 ms =C2=B1 210 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs, =
10 loops each)
>> >
>> >
>> >
>> > In [2]: df.query(=E2=80=9CuserId=3D=3D1=E2=80=9D)
>> >
>> > 1.16 ms =C2=B1 24.5 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs,=
 1000 loops
>> > each)
>> >
>> > ```
>> >
>> >
>> >
>> > [1]:
>> > https://www.kaggle.com/rounakbanik/the-movies-dataset?select=3Dratings=
.c
>> > sv
>> >
>> >
>> >
>> >
>> >
>> > Matthew M. Turner
>> >
>> > Email: matthew.m.turner@outlook.com
>> >
>> > Phone: (908)-868-2786
>> >
>> >
>>
>>

--=20
Dani=C3=ABl Heres

--00000000000039237505b64420ef
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"fon=
t-family:arial,helvetica,sans-serif">Hello Matthew,</div><div class=3D"gmai=
l_default" style=3D"font-family:arial,helvetica,sans-serif"><br></div><div =
class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif">If=
 you want to try to get absolutely the best performance you can get now fro=
m DataFusion:<br><br>* Make sure you are using the latest version from mast=
er, there have been a lot of improvements lately.<br><br>* Compile DataFusi=
on with &quot;simd&quot; feature on. This requires a recent version of Data=
Fusion, but it gives speeds for some computations.</div><div class=3D"gmail=
_default" style=3D"font-family:arial,helvetica,sans-serif"><br></div><div c=
lass=3D"gmail_default" style=3D""><span style=3D"font-family:arial,helvetic=
a,sans-serif">* Compile your code with </span><font face=3D"monospace">lto =
=3D true</font><font face=3D"arial, helvetica, sans-serif">=C2=A0like this =
in your Cargo.toml file:</font><br><br><font face=3D"monospace">[profile.re=
lease]<br>lto =3D true<br><br></font>This will increase the compile time co=
nsiderably, but allows Rust / LLVM to do more optimizations on the entire p=
rogram. There are some other settings documented here=C2=A0<a href=3D"https=
://doc.rust-lang.org/cargo/reference/profiles.html#release">https://doc.rus=
t-lang.org/cargo/reference/profiles.html#release<font face=3D"monospace"><b=
r></font></a><div class=3D"gmail_default" style=3D"font-family:arial,helvet=
ica,sans-serif"><br></div><div class=3D"gmail_default" style=3D"font-family=
:arial,helvetica,sans-serif">* Set the environment variable RUSTFLAGS=3D&qu=
ot;-C target-cpu=3Dnative&quot;. this allows Rust/LLVM to use all CPU instr=
uctions available on your CPU. This way the binary becomes not portable any=
more though.</div><font face=3D"monospace"><br></font></div><div class=3D"g=
mail_default" style=3D"font-family:arial,helvetica,sans-serif">We are also =
improving the performance over time, e.g. recently a lot parts in Arrow / D=
ataFusion have been improved in the last months such as faster CSV reader a=
nd faster computations, and there is still a lot to come.<br><br></div></di=
v><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-se=
rif">Best,<br><br>Dani=C3=ABl</div><br><div class=3D"gmail_quote"><div dir=
=3D"ltr" class=3D"gmail_attr">Op za 12 dec. 2020 om 05:06 schreef Jorge Car=
doso Leit=C3=A3o &lt;<a href=3D"mailto:jorgecarleitao@gmail.com" target=3D"=
_blank">jorgecarleitao@gmail.com</a>&gt;:<br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex"><div dir=3D"ltr"><div>Hi Mattew,</div><div><br></=
div><div>SchemaRef is just an alias for Arc&lt;Schema&gt;. Thus, you need t=
o wrap it on an Arc.<br></div><div><br></div><div>We do this because the pl=
ans are often passed between thread boundaries and thus wrapping them on an=
 Arc allows that.</div><div><br></div><div>Best,</div><div>Jorge</div><div>=
<br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gm=
ail_attr">On Fri, Dec 11, 2020 at 8:14 PM Matthew Turner &lt;<a href=3D"mai=
lto:matthew.m.turner@outlook.com" target=3D"_blank">matthew.m.turner@outloo=
k.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">





<div lang=3D"EN-US">
<div>
<p class=3D"MsoNormal">Thanks! Converting the schema to owned made it work.=
<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">The type of the schema param is SchemaRef =E2=80=93 =
which I thought would allow a reference.=C2=A0 Is this not the case?<u></u>=
<u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal"><b>Matthew M. Turner<u></u><u></u></b></p>
<p class=3D"MsoNormal">Email<b>:</b> <a href=3D"mailto:matthew.m.turner@out=
look.com" target=3D"_blank">
<span style=3D"color:rgb(5,99,193)">matthew.m.turner@outlook.com</span></a>=
<u></u><u></u></p>
<p class=3D"MsoNormal">Phone: (908)-868-2786<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div style=3D"border-color:rgb(225,225,225) currentcolor currentcolor;borde=
r-style:solid none none;border-width:1pt medium medium;padding:3pt 0in 0in"=
>
<p class=3D"MsoNormal"><b>From:</b> Andy Grove &lt;<a href=3D"mailto:andygr=
ove73@gmail.com" target=3D"_blank">andygrove73@gmail.com</a>&gt; <br>
<b>Sent:</b> Friday, December 11, 2020 10:16 AM<br>
<b>To:</b> <a href=3D"mailto:user@arrow.apache.org" target=3D"_blank">user@=
arrow.apache.org</a><br>
<b>Subject:</b> Re: [Rust] DataFusion performance<u></u><u></u></p>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<div>
<div>
<p class=3D"MsoNormal">Hi Matthew,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Using the latest DataFusion from GitHub master branc=
h, the following code works for in-memory:<u></u><u></u></p>
</div>
<div>
<pre style=3D"background:none 0% 0% repeat scroll rgb(43,43,43)"><span styl=
e=3D"font-size:11.5pt;color:rgb(204,120,50)">use </span><span style=3D"font=
-size:11.5pt;color:rgb(169,183,198)">std::sync::Arc</span><span style=3D"fo=
nt-size:11.5pt;color:rgb(204,120,50)">;<br>use </span><span style=3D"font-s=
ize:11.5pt;color:rgb(169,183,198)">std::time::Instant</span><span style=3D"=
font-size:11.5pt;color:rgb(204,120,50)">;<br><br>use </span><span style=3D"=
font-size:11.5pt;color:rgb(169,183,198)">datafusion::error::Result</span><s=
pan style=3D"font-size:11.5pt;color:rgb(204,120,50)">;<br>use </span><span =
style=3D"font-size:11.5pt;color:rgb(169,183,198)">datafusion::prelude::*</s=
pan><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">;<br>use </span>=
<span style=3D"font-size:11.5pt;color:rgb(169,183,198)">datafusion::datasou=
rce::MemTable</span><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">=
;<br><br></span><span style=3D"font-size:11.5pt;color:rgb(187,181,41)">#[to=
kio::main]<br></span><span style=3D"font-size:11.5pt;color:rgb(204,120,50)"=
>async fn </span><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">ma=
in</span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">() -&gt; R=
esult&lt;()&gt; {<br>=C2=A0=C2=A0=C2=A0 </span><span style=3D"font-size:11.=
5pt;color:gray">//</span><i><span style=3D"font-size:11.5pt;color:rgb(168,1=
92,35)">TODO add command-line args<br>=C2=A0=C2=A0=C2=A0 </span></i><span s=
tyle=3D"font-size:11.5pt;color:rgb(204,120,50)">let </span><span style=3D"f=
ont-size:11.5pt;color:rgb(169,183,198)">ratings_csv =3D </span><span style=
=3D"font-size:11.5pt;color:rgb(106,135,89)">&quot;/tmp/movies/ratings_small=
.csv&quot;</span><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">;<b=
r>=C2=A0=C2=A0=C2=A0 let mut </span><span style=3D"font-size:11.5pt;color:r=
gb(169,183,198)">ctx =3D ExecutionContext::</span><i><span style=3D"font-si=
ze:11.5pt;color:rgb(255,198,109)">new</span></i><span style=3D"font-size:11=
.5pt;color:rgb(169,183,198)">()</span><span style=3D"font-size:11.5pt;color=
:rgb(204,120,50)">;<br>=C2=A0=C2=A0=C2=A0 let </span><span style=3D"font-si=
ze:11.5pt;color:rgb(169,183,198)">df =3D ctx.</span><span style=3D"font-siz=
e:11.5pt;color:rgb(255,198,109)">read_csv</span><span style=3D"font-size:11=
.5pt;color:rgb(169,183,198)">(ratings_csv</span><span style=3D"font-size:11=
.5pt;color:rgb(204,120,50)">, </span><span style=3D"font-size:11.5pt;color:=
rgb(169,183,198)">CsvReadOptions::</span><i><span style=3D"font-size:11.5pt=
;color:rgb(255,198,109)">new</span></i><span style=3D"font-size:11.5pt;colo=
r:rgb(169,183,198)">()).</span><span style=3D"font-size:11.5pt;color:rgb(25=
5,198,109)">unwrap</span><span style=3D"font-size:11.5pt;color:rgb(169,183,=
198)">()</span><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">;<br>=
=C2=A0=C2=A0=C2=A0 let </span><span style=3D"font-size:11.5pt;color:rgb(169=
,183,198)">batches =3D </span><span style=3D"font-size:11.5pt;color:rgb(78,=
173,229)">vec!</span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)=
">[df.</span><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">collec=
t</span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">().</span><=
span style=3D"font-size:11.5pt;color:rgb(204,120,50)">await?</span><span st=
yle=3D"font-size:11.5pt;color:rgb(169,183,198)">]</span><span style=3D"font=
-size:11.5pt;color:rgb(204,120,50)">;<br>=C2=A0=C2=A0=C2=A0 let </span><spa=
n style=3D"font-size:11.5pt;color:rgb(169,183,198)">provider =3D MemTable::=
</span><i><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">new</span=
></i><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">(Arc::</span><=
i><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">new</span></i><sp=
an style=3D"font-size:11.5pt;color:rgb(169,183,198)">(df.</span><span style=
=3D"font-size:11.5pt;color:rgb(255,198,109)">schema</span><span style=3D"fo=
nt-size:11.5pt;color:rgb(169,183,198)">().</span><span style=3D"font-size:1=
1.5pt;color:rgb(255,198,109)">to_owned</span><span style=3D"font-size:11.5p=
t;color:rgb(169,183,198)">().</span><span style=3D"font-size:11.5pt;color:r=
gb(255,198,109)">into</span><span style=3D"font-size:11.5pt;color:rgb(169,1=
83,198)">())</span><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">,=
 </span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">batches)</s=
pan><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">?;<br>=C2=A0=C2=
=A0=C2=A0 </span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">ct=
x.</span><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">register_t=
able</span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">(</span>=
<span style=3D"font-size:11.5pt;color:rgb(106,135,89)">&quot;memory_table&q=
uot;</span><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">, </span>=
<span style=3D"font-size:11.5pt;color:rgb(169,183,198)">Box::</span><i><spa=
n style=3D"font-size:11.5pt;color:rgb(255,198,109)">new</span></i><span sty=
le=3D"font-size:11.5pt;color:rgb(169,183,198)">(provider))</span><span styl=
e=3D"font-size:11.5pt;color:rgb(204,120,50)">;<br>=C2=A0=C2=A0=C2=A0 let </=
span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">mem_df =3D ctx=
.</span><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">table</span=
><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">(</span><span styl=
e=3D"font-size:11.5pt;color:rgb(106,135,89)">&quot;memory_table&quot;</span=
><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">)</span><span styl=
e=3D"font-size:11.5pt;color:rgb(204,120,50)">?;<br>=C2=A0=C2=A0=C2=A0 let <=
/span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">q_start =3D I=
nstant::</span><i><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">n=
ow</span></i><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">()</sp=
an><span style=3D"font-size:11.5pt;color:rgb(204,120,50)">;<br>=C2=A0=C2=A0=
=C2=A0 let </span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">_=
results =3D mem_df<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .</span><s=
pan style=3D"font-size:11.5pt;color:rgb(255,198,109)">filter</span><span st=
yle=3D"font-size:11.5pt;color:rgb(169,183,198)">(</span><span style=3D"font=
-size:11.5pt;color:rgb(255,198,109)">col</span><span style=3D"font-size:11.=
5pt;color:rgb(169,183,198)">(</span><span style=3D"font-size:11.5pt;color:r=
gb(106,135,89)">&quot;userId&quot;</span><span style=3D"font-size:11.5pt;co=
lor:rgb(169,183,198)">).</span><span style=3D"font-size:11.5pt;color:rgb(25=
5,198,109)">eq</span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)=
">(</span><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">lit</span=
><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">(</span><span styl=
e=3D"font-size:11.5pt;color:rgb(104,151,187)">1</span><span style=3D"font-s=
ize:11.5pt;color:rgb(169,183,198)">)))</span><span style=3D"font-size:11.5p=
t;color:rgb(204,120,50)">?<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </=
span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">.</span><span =
style=3D"font-size:11.5pt;color:rgb(255,198,109)">collect</span><span style=
=3D"font-size:11.5pt;color:rgb(169,183,198)">()<br>=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 .</span><span style=3D"font-size:11.5pt;color:rgb(204,12=
0,50)">await<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span><span sty=
le=3D"font-size:11.5pt;color:rgb(169,183,198)">.</span><span style=3D"font-=
size:11.5pt;color:rgb(255,198,109)">unwrap</span><span style=3D"font-size:1=
1.5pt;color:rgb(169,183,198)">()</span><span style=3D"font-size:11.5pt;colo=
r:rgb(204,120,50)">;<br>=C2=A0=C2=A0=C2=A0 </span><span style=3D"font-size:=
11.5pt;color:rgb(78,173,229)">println!</span><span style=3D"font-size:11.5p=
t;color:rgb(169,183,198)">(</span><span style=3D"font-size:11.5pt;color:rgb=
(106,135,89)">&quot;Duration: </span><span style=3D"font-size:11.5pt;color:=
rgb(204,120,50)">{:?}</span><span style=3D"font-size:11.5pt;color:rgb(106,1=
35,89)">&quot;</span><span style=3D"font-size:11.5pt;color:rgb(204,120,50)"=
>, </span><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">q_start.<=
/span><span style=3D"font-size:11.5pt;color:rgb(255,198,109)">elapsed</span=
><span style=3D"font-size:11.5pt;color:rgb(169,183,198)">())</span><span st=
yle=3D"font-size:11.5pt;color:rgb(204,120,50)">;<br>=C2=A0=C2=A0=C2=A0 </sp=
an><i><span style=3D"font-size:11.5pt;color:rgb(152,118,170)">Ok</span></i>=
<span style=3D"font-size:11.5pt;color:rgb(169,183,198)">(())<br>}<u></u><u>=
</u></span></pre>
</div>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Andy.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal">On Fri, Dec 11, 2020 at 7:59 AM Matthew Turner &lt;<=
a href=3D"mailto:matthew.m.turner@outlook.com" target=3D"_blank">matthew.m.=
turner@outlook.com</a>&gt; wrote:<u></u><u></u></p>
</div>
<blockquote style=3D"border-color:currentcolor currentcolor currentcolor rg=
b(204,204,204);border-style:none none none solid;border-width:medium medium=
 medium 1pt;padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
<p class=3D"MsoNormal">Played around some more - it was because I wasn=E2=
=80=99t using --release flag.=C2=A0 Sry about that, still learning rust.<br=
>
<br>
Using that flag, the total time to read and filter is between 52 and 80ms.<=
br>
<br>
In general, what should I expect when comparing the performance of pandas t=
o datafusion?<br>
<br>
@Andy Grove thanks for adding that.=C2=A0 If there is a need for additional=
 datafusion benchmarks and what I do could help with that then I would be h=
appy to contribute it.=C2=A0 I will send a follow up once ive made progress=
.<br>
<br>
I&#39;m also still having trouble with that memory table, so any help there=
 is appreciated.<br>
<br>
Thanks for your time!=C2=A0 Very excited by this.<br>
<br>
Matthew M. Turner<br>
Email: <a href=3D"mailto:matthew.m.turner@outlook.com" target=3D"_blank">ma=
tthew.m.turner@outlook.com</a><br>
Phone: (908)-868-2786<br>
<br>
-----Original Message-----<br>
From: Matthew Turner &lt;<a href=3D"mailto:matthew.m.turner@outlook.com" ta=
rget=3D"_blank">matthew.m.turner@outlook.com</a>&gt;
<br>
Sent: Friday, December 11, 2020 12:24 AM<br>
To: <a href=3D"mailto:user@arrow.apache.org" target=3D"_blank">user@arrow.a=
pache.org</a><br>
Subject: RE: [Rust] DataFusion performance<br>
<br>
Thanks for context! Makes sense.<br>
<br>
Even with that, when comparing the total time of each (read + filter) DataF=
usion still appears much slower(~625ms vs 33ms).=C2=A0 Is that expected?<br=
>
<br>
Also, im trying to bring the table in memory now to perform the computation=
 from there and compare performance.=C2=A0 Code below.=C2=A0 But I&#39;m ge=
tting an error (beneath the code) even though I think ive constructed the M=
emTable correctly (from [1]).=C2=A0 From what I see
 all the types are the same as when I used the original df from read_csv so=
 I&#39;m not sure what I&#39;m doing wrong.<br>
<br>
I also saw there was an open issue [2] for this error type raised on rust-l=
ang - so im unsure if its my implementation, datafusion/arrow issue, or Rus=
t issue.<br>
<br>
Thanks again for help!<br>
<br>
```<br>
=C2=A0 =C2=A0 let sch =3D Arc::new(df.schema());<br>
=C2=A0 =C2=A0 let batches =3D vec![df.collect().await?];<br>
=C2=A0 =C2=A0 let provider =3D MemTable::new(sch, batches)?;<br>
<br>
=C2=A0 =C2=A0 ctx.register_table(&quot;memory_table&quot;, Box::new(provide=
r));<br>
<br>
=C2=A0 =C2=A0 let mem_df =3D ctx.table(&quot;memory_table&quot;)?;<br>
<br>
=C2=A0 =C2=A0 let q_start =3D Instant::now();<br>
=C2=A0 =C2=A0 let results =3D mem_df<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .filter(col(&quot;userId&quot;).eq(lit(1)))?<br=
>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .collect()<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .await<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .unwrap();<br>
```<br>
<br>
Which is returning this error:<br>
<br>
error[E0698]: type inside `async` block must be known in this context<br>
=C2=A0 --&gt; src\main.rs:37:38<br>
=C2=A0 =C2=A0|<br>
37 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.filter(col(&quot;userId&quot;).eq(li=
t(1)))?<br>
=C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ^=
 cannot infer type for type `{integer}`<br>
=C2=A0 =C2=A0|<br>
note: the type is part of the `async` block because of this `await`<br>
=C2=A0 --&gt; src\main.rs:36:19<br>
=C2=A0 =C2=A0|<br>
36 |=C2=A0 =C2=A0 =C2=A0 =C2=A0let results =3D mem_df<br>
=C2=A0 =C2=A0|=C2=A0 ___________________^<br>
37 | |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.filter(col(&quot;userId&quot;).eq(=
lit(1)))?<br>
38 | |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.collect()<br>
39 | |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.await<br>
=C2=A0 =C2=A0| |______________^<br>
<br>
<br>
[1] <a href=3D"https://github.com/apache/arrow/blob/master/rust/datafusion/=
examples/dataframe_in_memory.rs" target=3D"_blank">
https://github.com/apache/arrow/blob/master/rust/datafusion/examples/datafr=
ame_in_memory.rs</a><br>
[2] <a href=3D"https://github.com/rust-lang/rust/issues/63502" target=3D"_b=
lank">https://github.com/rust-lang/rust/issues/63502</a><br>
<br>
Matthew M. Turner<br>
Email: <a href=3D"mailto:matthew.m.turner@outlook.com" target=3D"_blank">ma=
tthew.m.turner@outlook.com</a><br>
Phone: (908)-868-2786<br>
<br>
-----Original Message-----<br>
From: Michael Mior &lt;<a href=3D"mailto:mmior@apache.org" target=3D"_blank=
">mmior@apache.org</a>&gt;<br>
Sent: Thursday, December 10, 2020 8:55 PM<br>
To: <a href=3D"mailto:user@arrow.apache.org" target=3D"_blank">user@arrow.a=
pache.org</a><br>
Subject: Re: [Rust] DataFusion performance<br>
<br>
Contrary to what you might expect given the name, read_csv does not actuall=
y read the CSV file. It instead creates the start of a logical execution pl=
an which involves reading the CSV file when that plan is finally executed. =
This happens when you call collect().<br>
<br>
Pandas read_csv on the other hand immediately reads the CSV file. So you=
9;re comparing the time of reading AND filtering the file<br>
(DataFusion) with the time to filter data which has already been read (Pand=
as).<br>
<br>
There&#39;s nothing wrong with your use of DataFusion per se, you simply we=
ren&#39;t measuring what you thought you were measuring.<br>
--<br>
Michael Mior<br>
<a href=3D"mailto:mmior@apache.org" target=3D"_blank">mmior@apache.org</a><=
br>
<br>
Le jeu. 10 d=C3=A9c. 2020 =C3=A0 17:11, Matthew Turner &lt;<a href=3D"mailt=
o:matthew.m.turner@outlook.com" target=3D"_blank">matthew.m.turner@outlook.=
com</a>&gt; a =C3=A9crit :<br>
&gt;<br>
&gt; Hello,<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; I=E2=80=99ve been playing around with DataFusion to explore the feasib=
ility of replacing current python/pandas data processing jobs with Rust/dat=
afusion.=C2=A0 Ultimately, looking to improve performance / decrease cost.<=
br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; I was doing some simple tests to start to measure performance differen=
ces on a simple task (read a csv[1] and filter it).<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Reading the csv datafusion seemed to outperform pandas by around 30% w=
hich was nice.<br>
&gt;<br>
&gt; *Rust took around 20-25ms to read the csv (compared to 32ms from<br>
&gt; pandas)<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; However, when filtering the data I was surprised to see that pandas wa=
s way faster.<br>
&gt;<br>
&gt; *Rust took around 500-600ms to filter the csv(compared to 1ms from<br>
&gt; pandas)<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; My code for each is below.=C2=A0 I know I should be running the DataFu=
sion times through something similar to pythons %timeit but I didn=E2=80=99=
t have that immediately accessible and I ran many times to confirm it was r=
oughly consistent.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Is this performance expected? Or am I using datafusion incorrectly?<br=
>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Any insight is much appreciated!<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; [Rust]<br>
&gt;<br>
&gt; ```<br>
&gt;<br>
&gt; use datafusion::error::Result;<br>
&gt;<br>
&gt; use datafusion::prelude::*;<br>
&gt;<br>
&gt; use std::time::Instant;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; #[tokio::main]<br>
&gt;<br>
&gt; async fn main() -&gt; Result&lt;()&gt; {<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0let start =3D Instant::now();<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0let mut ctx =3D ExecutionContext::new();<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0let ratings_csv =3D &quot;ratings_small.csv&quot;;<=
br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0let df =3D ctx.read_csv(ratings_csv, <br>
&gt; CsvReadOptions::new()).unwrap();<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0println!(&quot;Read CSV Duration: {:?}&quot;, start=
.elapsed());<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0let q_start =3D Instant::now();<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0let results =3D df<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.filter(col(&quot;userId&quot;).eq(li=
t(1)))?<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.collect()<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.await<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.unwrap();<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0println!(&quot;Filter duration: {:?}&quot;, q_start=
.elapsed());<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0println!(&quot;Duration: {:?}&quot;, start.elapsed(=
));<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0Ok(())<br>
&gt;<br>
&gt; }<br>
&gt;<br>
&gt; ```<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; [Python]<br>
&gt;<br>
&gt; ```<br>
&gt;<br>
&gt; In [1]: df =3D pd.read_csv(=E2=80=9Cratings_small.csv=E2=80=9D)<br>
&gt;<br>
&gt; 32.4 ms =C2=B1 210 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs, =
10 loops each)<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; In [2]: df.query(=E2=80=9CuserId=3D=3D1=E2=80=9D)<br>
&gt;<br>
&gt; 1.16 ms =C2=B1 24.5 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs,=
 1000 loops<br>
&gt; each)<br>
&gt;<br>
&gt; ```<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; [1]: <br>
&gt; <a href=3D"https://www.kaggle.com/rounakbanik/the-movies-dataset?selec=
t=3Dratings.c" target=3D"_blank">
https://www.kaggle.com/rounakbanik/the-movies-dataset?select=3Dratings.c</a=
><br>
&gt; sv<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Matthew M. Turner<br>
&gt;<br>
&gt; Email: <a href=3D"mailto:matthew.m.turner@outlook.com" target=3D"_blan=
k">matthew.m.turner@outlook.com</a><br>
&gt;<br>
&gt; Phone: (908)-868-2786<br>
&gt;<br>
&gt;<u></u><u></u></p>
</blockquote>
</div>
</div>
</div>
</div>
</div>

</blockquote></div>
</blockquote></div><br clear=3D"all"><div><br></div>-- <br><div dir=3D"ltr"=
>Dani=C3=ABl Heres<br></div></div>

--00000000000039237505b64420ef--