From user-return-847-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sat Dec 12 13:10:04 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 96B48180648 for ; Sat, 12 Dec 2020 14:10:04 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id C191E46A93 for ; Sat, 12 Dec 2020 13:10:03 +0000 (UTC) Received: (qmail 55317 invoked by uid 500); 12 Dec 2020 13:10:03 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 55307 invoked by uid 99); 12 Dec 2020 13:10:02 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2020 13:10:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 2FAE21FF39A for ; Sat, 12 Dec 2020 13:10:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id PH64m7vLsJhh for ; Sat, 12 Dec 2020 13:10:00 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::b35; helo=mail-yb1-xb35.google.com; envelope-from=danielheres@gmail.com; receiver= Received: from mail-yb1-xb35.google.com (mail-yb1-xb35.google.com [IPv6:2607:f8b0:4864:20::b35]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id B074F7F9E9 for ; Sat, 12 Dec 2020 13:09:59 +0000 (UTC) Received: by mail-yb1-xb35.google.com with SMTP id l14so10817614ybq.3 for ; Sat, 12 Dec 2020 05:09:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=aedqYj3WttcNGjuzeNJJxXRjGpksW3qQguGNu4kMxCk=; b=t4DeSbXXZBlXj1olQjVwcyQQ8ugG1RUorEPUcq/axyCJLkb1e8zwnvEZHBUgfjywPE GcE4ExMSzM9lHLfMUGgPYYqoUiJrqR0pOqggcaJKice8ScwVE7PsFEMfrRtogRpm/j8T MAyfseotCGl64XzJkmL21CJD8dAWVPtPC05o0MTy8+BcVcGnHzar0xeLHKuU0+ybRzkS QlZapomvR2BrNfrDm63FY/41zkjEqkTRuPY85+dHUjIPZf0RE7j8n/TIXNnsGPNJMADt 7G6p+2v4MuV/uKPGtxlnUdZg/KIrtpIPTWGdCrY8KBzLZaEAp/AbxVY6ps1SF+NADSvH w6KQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=aedqYj3WttcNGjuzeNJJxXRjGpksW3qQguGNu4kMxCk=; b=IIKNheQbZSf75nV7F/4qm8HHX12tF1vkHTtfh7a0wpXGJpv1VEkOpvZOWhnDkXfmDB 8aMHzCOkiiduCgfUkCXxBDprs/xZLFTZQnSqZ++5LQ1ogdAecmlxSfP/0UNQzwdfTpBF ayJiNFesm+lPgboni9fk7+UUHnCUuYOCo4VYegRJDxxFEos6GBVP2HE5aDpCQl7vcgXO MrJjDlljYjdjwd0p2loNQZbXAOWYWPxdf9DZ8ahmEV+qjqTcH43/ODLs0ouoqdTfyubF XS/ra1m49FAneKBgXjQTXAxSvvfJ/B2rpkaVLDt22AfbximlM0XX0/8wCAVeRQlen0Se jtsg== X-Gm-Message-State: AOAM530+yItoi4Ww0lTuqj/Sva/+ucnYqtDXoj+g7ySsS4zYJYYczfsd sCuzWC+2ygQIF/YH7O1nYKW4IyzCwtd5qgS4Jilk3Xrkovmn3A== X-Google-Smtp-Source: ABdhPJxfYpu55MFd9m5JALitMJIQxSC2kbGBuruI0Z2ANjjTczWP51xzBBBsA07/MeLvv5iD+ttCVR8aW1zxROBTTXY= X-Received: by 2002:a25:ca14:: with SMTP id a20mr6518193ybg.303.1607778598198; Sat, 12 Dec 2020 05:09:58 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?Q?Dani=C3=ABl_Heres?= Date: Sat, 12 Dec 2020 14:09:47 +0100 Message-ID: Subject: Re: [Rust] DataFusion performance To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="00000000000039237505b64420ef" --00000000000039237505b64420ef Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello Matthew, If you want to try to get absolutely the best performance you can get now from DataFusion: * Make sure you are using the latest version from master, there have been a lot of improvements lately. * Compile DataFusion with "simd" feature on. This requires a recent version of DataFusion, but it gives speeds for some computations. * Compile your code with lto =3D true like this in your Cargo.toml file: [profile.release] lto =3D true This will increase the compile time considerably, but allows Rust / LLVM to do more optimizations on the entire program. There are some other settings documented here https://doc.rust-lang.org/cargo/reference/profiles.html#release * Set the environment variable RUSTFLAGS=3D"-C target-cpu=3Dnative". this allows Rust/LLVM to use all CPU instructions available on your CPU. This way the binary becomes not portable anymore though. We are also improving the performance over time, e.g. recently a lot parts in Arrow / DataFusion have been improved in the last months such as faster CSV reader and faster computations, and there is still a lot to come. Best, Dani=C3=ABl Op za 12 dec. 2020 om 05:06 schreef Jorge Cardoso Leit=C3=A3o < jorgecarleitao@gmail.com>: > Hi Mattew, > > SchemaRef is just an alias for Arc. Thus, you need to wrap it on > an Arc. > > We do this because the plans are often passed between thread boundaries > and thus wrapping them on an Arc allows that. > > Best, > Jorge > > > On Fri, Dec 11, 2020 at 8:14 PM Matthew Turner < > matthew.m.turner@outlook.com> wrote: > >> Thanks! Converting the schema to owned made it work. >> >> >> >> The type of the schema param is SchemaRef =E2=80=93 which I thought woul= d allow a >> reference. Is this not the case? >> >> >> >> *Matthew M. Turner* >> >> Email*:* matthew.m.turner@outlook.com >> >> Phone: (908)-868-2786 >> >> >> >> *From:* Andy Grove >> *Sent:* Friday, December 11, 2020 10:16 AM >> *To:* user@arrow.apache.org >> *Subject:* Re: [Rust] DataFusion performance >> >> >> >> Hi Matthew, >> >> >> >> Using the latest DataFusion from GitHub master branch, the following cod= e >> works for in-memory: >> >> use std::sync::Arc; >> use std::time::Instant; >> >> use datafusion::error::Result; >> use datafusion::prelude::*; >> use datafusion::datasource::MemTable; >> >> #[tokio::main] >> async fn main() -> Result<()> { >> // >> *TODO add command-line args *let ratings_csv =3D "/tmp/movies/ratings= _small.csv"; >> let mut ctx =3D ExecutionContext::*new*(); >> let df =3D ctx.read_csv(ratings_csv, CsvReadOptions::*new*()).unwrap= (); >> let batches =3D vec![df.collect().await?]; >> let provider =3D MemTable::*new*(Arc::*new*(df.schema().to_owned().i= nto()), batches)?; >> ctx.register_table("memory_table", Box::*new*(provider)); >> let mem_df =3D ctx.table("memory_table")?; >> let q_start =3D Instant::*now*(); >> let _results =3D mem_df >> .filter(col("userId").eq(lit(1)))? >> .collect() >> .await >> .unwrap(); >> println!("Duration: {:?}", q_start.elapsed()); >> *Ok*(()) >> } >> >> >> >> Andy. >> >> >> >> On Fri, Dec 11, 2020 at 7:59 AM Matthew Turner < >> matthew.m.turner@outlook.com> wrote: >> >> Played around some more - it was because I wasn=E2=80=99t using --releas= e flag. >> Sry about that, still learning rust. >> >> Using that flag, the total time to read and filter is between 52 and 80m= s. >> >> In general, what should I expect when comparing the performance of panda= s >> to datafusion? >> >> @Andy Grove thanks for adding that. If there is a need for additional >> datafusion benchmarks and what I do could help with that then I would be >> happy to contribute it. I will send a follow up once ive made progress. >> >> I'm also still having trouble with that memory table, so any help there >> is appreciated. >> >> Thanks for your time! Very excited by this. >> >> Matthew M. Turner >> Email: matthew.m.turner@outlook.com >> Phone: (908)-868-2786 >> >> -----Original Message----- >> From: Matthew Turner >> Sent: Friday, December 11, 2020 12:24 AM >> To: user@arrow.apache.org >> Subject: RE: [Rust] DataFusion performance >> >> Thanks for context! Makes sense. >> >> Even with that, when comparing the total time of each (read + filter) >> DataFusion still appears much slower(~625ms vs 33ms). Is that expected? >> >> Also, im trying to bring the table in memory now to perform the >> computation from there and compare performance. Code below. But I'm >> getting an error (beneath the code) even though I think ive constructed = the >> MemTable correctly (from [1]). From what I see all the types are the sa= me >> as when I used the original df from read_csv so I'm not sure what I'm do= ing >> wrong. >> >> I also saw there was an open issue [2] for this error type raised on >> rust-lang - so im unsure if its my implementation, datafusion/arrow issu= e, >> or Rust issue. >> >> Thanks again for help! >> >> ``` >> let sch =3D Arc::new(df.schema()); >> let batches =3D vec![df.collect().await?]; >> let provider =3D MemTable::new(sch, batches)?; >> >> ctx.register_table("memory_table", Box::new(provider)); >> >> let mem_df =3D ctx.table("memory_table")?; >> >> let q_start =3D Instant::now(); >> let results =3D mem_df >> .filter(col("userId").eq(lit(1)))? >> .collect() >> .await >> .unwrap(); >> ``` >> >> Which is returning this error: >> >> error[E0698]: type inside `async` block must be known in this context >> --> src\main.rs:37:38 >> | >> 37 | .filter(col("userId").eq(lit(1)))? >> | ^ cannot infer type for type >> `{integer}` >> | >> note: the type is part of the `async` block because of this `await` >> --> src\main.rs:36:19 >> | >> 36 | let results =3D mem_df >> | ___________________^ >> 37 | | .filter(col("userId").eq(lit(1)))? >> 38 | | .collect() >> 39 | | .await >> | |______________^ >> >> >> [1] >> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/dat= aframe_in_memory.rs >> [2] https://github.com/rust-lang/rust/issues/63502 >> >> Matthew M. Turner >> Email: matthew.m.turner@outlook.com >> Phone: (908)-868-2786 >> >> -----Original Message----- >> From: Michael Mior >> Sent: Thursday, December 10, 2020 8:55 PM >> To: user@arrow.apache.org >> Subject: Re: [Rust] DataFusion performance >> >> Contrary to what you might expect given the name, read_csv does not >> actually read the CSV file. It instead creates the start of a logical >> execution plan which involves reading the CSV file when that plan is >> finally executed. This happens when you call collect(). >> >> Pandas read_csv on the other hand immediately reads the CSV file. So >> you're comparing the time of reading AND filtering the file >> (DataFusion) with the time to filter data which has already been read >> (Pandas). >> >> There's nothing wrong with your use of DataFusion per se, you simply >> weren't measuring what you thought you were measuring. >> -- >> Michael Mior >> mmior@apache.org >> >> Le jeu. 10 d=C3=A9c. 2020 =C3=A0 17:11, Matthew Turner < >> matthew.m.turner@outlook.com> a =C3=A9crit : >> > >> > Hello, >> > >> > >> > >> > I=E2=80=99ve been playing around with DataFusion to explore the feasib= ility of >> replacing current python/pandas data processing jobs with Rust/datafusio= n. >> Ultimately, looking to improve performance / decrease cost. >> > >> > >> > >> > I was doing some simple tests to start to measure performance >> differences on a simple task (read a csv[1] and filter it). >> > >> > >> > >> > Reading the csv datafusion seemed to outperform pandas by around 30% >> which was nice. >> > >> > *Rust took around 20-25ms to read the csv (compared to 32ms from >> > pandas) >> > >> > >> > >> > However, when filtering the data I was surprised to see that pandas wa= s >> way faster. >> > >> > *Rust took around 500-600ms to filter the csv(compared to 1ms from >> > pandas) >> > >> > >> > >> > My code for each is below. I know I should be running the DataFusion >> times through something similar to pythons %timeit but I didn=E2=80=99t = have that >> immediately accessible and I ran many times to confirm it was roughly >> consistent. >> > >> > >> > >> > Is this performance expected? Or am I using datafusion incorrectly? >> > >> > >> > >> > Any insight is much appreciated! >> > >> > >> > >> > [Rust] >> > >> > ``` >> > >> > use datafusion::error::Result; >> > >> > use datafusion::prelude::*; >> > >> > use std::time::Instant; >> > >> > >> > >> > #[tokio::main] >> > >> > async fn main() -> Result<()> { >> > >> > let start =3D Instant::now(); >> > >> > >> > >> > let mut ctx =3D ExecutionContext::new(); >> > >> > >> > >> > let ratings_csv =3D "ratings_small.csv"; >> > >> > >> > >> > let df =3D ctx.read_csv(ratings_csv, >> > CsvReadOptions::new()).unwrap(); >> > >> > println!("Read CSV Duration: {:?}", start.elapsed()); >> > >> > >> > >> > let q_start =3D Instant::now(); >> > >> > let results =3D df >> > >> > .filter(col("userId").eq(lit(1)))? >> > >> > .collect() >> > >> > .await >> > >> > .unwrap(); >> > >> > println!("Filter duration: {:?}", q_start.elapsed()); >> > >> > >> > >> > println!("Duration: {:?}", start.elapsed()); >> > >> > >> > >> > Ok(()) >> > >> > } >> > >> > ``` >> > >> > >> > >> > [Python] >> > >> > ``` >> > >> > In [1]: df =3D pd.read_csv(=E2=80=9Cratings_small.csv=E2=80=9D) >> > >> > 32.4 ms =C2=B1 210 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs, = 10 loops each) >> > >> > >> > >> > In [2]: df.query(=E2=80=9CuserId=3D=3D1=E2=80=9D) >> > >> > 1.16 ms =C2=B1 24.5 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs,= 1000 loops >> > each) >> > >> > ``` >> > >> > >> > >> > [1]: >> > https://www.kaggle.com/rounakbanik/the-movies-dataset?select=3Dratings= .c >> > sv >> > >> > >> > >> > >> > >> > Matthew M. Turner >> > >> > Email: matthew.m.turner@outlook.com >> > >> > Phone: (908)-868-2786 >> > >> > >> >> --=20 Dani=C3=ABl Heres --00000000000039237505b64420ef Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello Matthew,

If= you want to try to get absolutely the best performance you can get now fro= m DataFusion:

* Make sure you are using the latest version from mast= er, there have been a lot of improvements lately.

* Compile DataFusi= on with "simd" feature on. This requires a recent version of Data= Fusion, but it gives speeds for some computations.

* Compile your code with lto = =3D true=C2=A0like this = in your Cargo.toml file:

[profile.re= lease]
lto =3D true

This will increase the compile time co= nsiderably, but allows Rust / LLVM to do more optimizations on the entire p= rogram. There are some other settings documented here=C2=A0https://doc.rus= t-lang.org/cargo/reference/profiles.html#release

* Set the environment variable RUSTFLAGS=3D&qu= ot;-C target-cpu=3Dnative". this allows Rust/LLVM to use all CPU instr= uctions available on your CPU. This way the binary becomes not portable any= more though.

We are also = improving the performance over time, e.g. recently a lot parts in Arrow / D= ataFusion have been improved in the last months such as faster CSV reader a= nd faster computations, and there is still a lot to come.

Best,

Dani=C3=ABl

Op za 12 dec. 2020 om 05:06 schreef Jorge Car= doso Leit=C3=A3o <jorgecarleitao@gmail.com>:
Hi Mattew,

SchemaRef is just an alias for Arc<Schema>. Thus, you need t= o wrap it on an Arc.

We do this because the pl= ans are often passed between thread boundaries and thus wrapping them on an= Arc allows that.

Best,
Jorge
=

On Fri, Dec 11, 2020 at 8:14 PM Matthew Turner <matthew.m.turner@outloo= k.com> wrote:

Thanks! Converting the schema to owned made it work.=

=C2=A0

The type of the schema param is SchemaRef =E2=80=93 = which I thought would allow a reference.=C2=A0 Is this not the case?=

=C2=A0

Matthew M. Turner

Email: matthew.m.turner@outlook.com=

Phone: (908)-868-2786

=C2=A0

From: Andy Grove <andygrove73@gmail.com>
Sent: Friday, December 11, 2020 10:16 AM
To: user@= arrow.apache.org
Subject: Re: [Rust] DataFusion performance

=C2=A0

Hi Matthew,

=C2=A0

Using the latest DataFusion from GitHub master branc= h, the following code works for in-memory:

use std::sync::Arc;
use
std::time::Instant;

use
datafusion::error::Result;
use datafusion::prelude::*
;
use
= datafusion::datasou= rce::MemTable= ;

#[to= kio::main]
async fn ma= in() -> R= esult<()> {
=C2=A0=C2=A0=C2=A0
//TODO add command-line args
=C2=A0=C2=A0=C2=A0
let ratings_csv =3D "/tmp/movies/ratings_small= .csv";=C2=A0=C2=A0=C2=A0 let mut ctx =3D ExecutionContext::new();
=C2=A0=C2=A0=C2=A0 let
df =3D ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap();
= =C2=A0=C2=A0=C2=A0 let
batches =3D vec![df.collec= t().<= span style=3D"font-size:11.5pt;color:rgb(204,120,50)">await?];
=C2=A0=C2=A0=C2=A0 let
provider =3D MemTable::= new(Arc::<= i>new(df.schema().to_owned().into()),= batches)?;
=C2=A0=C2= =A0=C2=A0
ct= x.register_t= able(= "memory_table&q= uot;, = Box::new
(provider));
=C2=A0=C2=A0=C2=A0 let mem_df =3D ctx= .table("memory_table")?;
=C2=A0=C2=A0=C2=A0 let <= /span>q_start =3D I= nstant::n= ow()
;
=C2=A0=C2=A0= =C2=A0 let
_= results =3D mem_df
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .
filter(col("userId").eq(lit(1)))?
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .collect()
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 .
await
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
.unwrap();
=C2=A0=C2=A0=C2=A0
println!("Duration: {:?}", q_start.<= /span>elapsed());
=C2=A0=C2=A0=C2=A0 Ok= (())
}=

=C2=A0

Andy.

=C2=A0

On Fri, Dec 11, 2020 at 7:59 AM Matthew Turner <<= a href=3D"mailto:matthew.m.turner@outlook.com" target=3D"_blank">matthew.m.= turner@outlook.com> wrote:

Played around some more - it was because I wasn=E2= =80=99t using --release flag.=C2=A0 Sry about that, still learning rust.
Using that flag, the total time to read and filter is between 52 and 80ms.<= br>
In general, what should I expect when comparing the performance of pandas t= o datafusion?

@Andy Grove thanks for adding that.=C2=A0 If there is a need for additional= datafusion benchmarks and what I do could help with that then I would be h= appy to contribute it.=C2=A0 I will send a follow up once ive made progress= .

I'm also still having trouble with that memory table, so any help there= is appreciated.

Thanks for your time!=C2=A0 Very excited by this.

Matthew M. Turner
Email: ma= tthew.m.turner@outlook.com
Phone: (908)-868-2786

-----Original Message-----
From: Matthew Turner <matthew.m.turner@outlook.com>
Sent: Friday, December 11, 2020 12:24 AM
To: user@arrow.a= pache.org
Subject: RE: [Rust] DataFusion performance

Thanks for context! Makes sense.

Even with that, when comparing the total time of each (read + filter) DataF= usion still appears much slower(~625ms vs 33ms).=C2=A0 Is that expected?
Also, im trying to bring the table in memory now to perform the computation= from there and compare performance.=C2=A0 Code below.=C2=A0 But I'm ge= tting an error (beneath the code) even though I think ive constructed the M= emTable correctly (from [1]).=C2=A0 From what I see all the types are the same as when I used the original df from read_csv so= I'm not sure what I'm doing wrong.

I also saw there was an open issue [2] for this error type raised on rust-l= ang - so im unsure if its my implementation, datafusion/arrow issue, or Rus= t issue.

Thanks again for help!

```
=C2=A0 =C2=A0 let sch =3D Arc::new(df.schema());
=C2=A0 =C2=A0 let batches =3D vec![df.collect().await?];
=C2=A0 =C2=A0 let provider =3D MemTable::new(sch, batches)?;

=C2=A0 =C2=A0 ctx.register_table("memory_table", Box::new(provide= r));

=C2=A0 =C2=A0 let mem_df =3D ctx.table("memory_table")?;

=C2=A0 =C2=A0 let q_start =3D Instant::now();
=C2=A0 =C2=A0 let results =3D mem_df
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .filter(col("userId").eq(lit(1)))? =C2=A0 =C2=A0 =C2=A0 =C2=A0 .collect()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .await
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .unwrap();
```

Which is returning this error:

error[E0698]: type inside `async` block must be known in this context
=C2=A0 --> src\main.rs:37:38
=C2=A0 =C2=A0|
37 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.filter(col("userId").eq(li= t(1)))?
=C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ^= cannot infer type for type `{integer}`
=C2=A0 =C2=A0|
note: the type is part of the `async` block because of this `await`
=C2=A0 --> src\main.rs:36:19
=C2=A0 =C2=A0|
36 |=C2=A0 =C2=A0 =C2=A0 =C2=A0let results =3D mem_df
=C2=A0 =C2=A0|=C2=A0 ___________________^
37 | |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.filter(col("userId").eq(= lit(1)))?
38 | |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.collect()
39 | |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.await
=C2=A0 =C2=A0| |______________^


[1] https://github.com/apache/arrow/blob/master/rust/datafusion/examples/datafr= ame_in_memory.rs
[2] https://github.com/rust-lang/rust/issues/63502

Matthew M. Turner
Email: ma= tthew.m.turner@outlook.com
Phone: (908)-868-2786

-----Original Message-----
From: Michael Mior <mmior@apache.org>
Sent: Thursday, December 10, 2020 8:55 PM
To: user@arrow.a= pache.org
Subject: Re: [Rust] DataFusion performance

Contrary to what you might expect given the name, read_csv does not actuall= y read the CSV file. It instead creates the start of a logical execution pl= an which involves reading the CSV file when that plan is finally executed. = This happens when you call collect().

Pandas read_csv on the other hand immediately reads the CSV file. So you= 9;re comparing the time of reading AND filtering the file
(DataFusion) with the time to filter data which has already been read (Pand= as).

There's nothing wrong with your use of DataFusion per se, you simply we= ren't measuring what you thought you were measuring.
--
Michael Mior
mmior@apache.org<= br>
Le jeu. 10 d=C3=A9c. 2020 =C3=A0 17:11, Matthew Turner <matthew.m.turner@outlook.= com> a =C3=A9crit :
>
> Hello,
>
>
>
> I=E2=80=99ve been playing around with DataFusion to explore the feasib= ility of replacing current python/pandas data processing jobs with Rust/dat= afusion.=C2=A0 Ultimately, looking to improve performance / decrease cost.<= br> >
>
>
> I was doing some simple tests to start to measure performance differen= ces on a simple task (read a csv[1] and filter it).
>
>
>
> Reading the csv datafusion seemed to outperform pandas by around 30% w= hich was nice.
>
> *Rust took around 20-25ms to read the csv (compared to 32ms from
> pandas)
>
>
>
> However, when filtering the data I was surprised to see that pandas wa= s way faster.
>
> *Rust took around 500-600ms to filter the csv(compared to 1ms from
> pandas)
>
>
>
> My code for each is below.=C2=A0 I know I should be running the DataFu= sion times through something similar to pythons %timeit but I didn=E2=80=99= t have that immediately accessible and I ran many times to confirm it was r= oughly consistent.
>
>
>
> Is this performance expected? Or am I using datafusion incorrectly? >
>
>
> Any insight is much appreciated!
>
>
>
> [Rust]
>
> ```
>
> use datafusion::error::Result;
>
> use datafusion::prelude::*;
>
> use std::time::Instant;
>
>
>
> #[tokio::main]
>
> async fn main() -> Result<()> {
>
>=C2=A0 =C2=A0 =C2=A0let start =3D Instant::now();
>
>
>
>=C2=A0 =C2=A0 =C2=A0let mut ctx =3D ExecutionContext::new();
>
>
>
>=C2=A0 =C2=A0 =C2=A0let ratings_csv =3D "ratings_small.csv";<= br> >
>
>
>=C2=A0 =C2=A0 =C2=A0let df =3D ctx.read_csv(ratings_csv,
> CsvReadOptions::new()).unwrap();
>
>=C2=A0 =C2=A0 =C2=A0println!("Read CSV Duration: {:?}", start= .elapsed());
>
>
>
>=C2=A0 =C2=A0 =C2=A0let q_start =3D Instant::now();
>
>=C2=A0 =C2=A0 =C2=A0let results =3D df
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.filter(col("userId").eq(li= t(1)))?
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.collect()
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.await
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.unwrap();
>
>=C2=A0 =C2=A0 =C2=A0println!("Filter duration: {:?}", q_start= .elapsed());
>
>
>
>=C2=A0 =C2=A0 =C2=A0println!("Duration: {:?}", start.elapsed(= ));
>
>
>
>=C2=A0 =C2=A0 =C2=A0Ok(())
>
> }
>
> ```
>
>
>
> [Python]
>
> ```
>
> In [1]: df =3D pd.read_csv(=E2=80=9Cratings_small.csv=E2=80=9D)
>
> 32.4 ms =C2=B1 210 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs, = 10 loops each)
>
>
>
> In [2]: df.query(=E2=80=9CuserId=3D=3D1=E2=80=9D)
>
> 1.16 ms =C2=B1 24.5 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs,= 1000 loops
> each)
>
> ```
>
>
>
> [1]:
> https://www.kaggle.com/rounakbanik/the-movies-dataset?select=3Dratings.c
> sv
>
>
>
>
>
> Matthew M. Turner
>
> Email:
matthew.m.turner@outlook.com
>
> Phone: (908)-868-2786
>
>



--
Dani=C3=ABl Heres
--00000000000039237505b64420ef--