From user-return-841-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri Dec 11 05:49:14 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id B1FCE180637 for ; Fri, 11 Dec 2020 06:49:14 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 1976549333 for ; Fri, 11 Dec 2020 05:49:13 +0000 (UTC) Received: (qmail 53754 invoked by uid 500); 11 Dec 2020 05:49:12 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 53741 invoked by uid 99); 11 Dec 2020 05:49:12 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Dec 2020 05:49:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 8BC2B1FF39A for ; Fri, 11 Dec 2020 05:49:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 1.249 X-Spam-Level: * X-Spam-Status: No, score=1.249 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id 2TrjQfTg6t4d for ; Fri, 11 Dec 2020 05:49:10 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.181; helo=mail-lj1-f181.google.com; envelope-from=andygrove73@gmail.com; receiver= Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 0D652BC975 for ; Fri, 11 Dec 2020 05:49:10 +0000 (UTC) Received: by mail-lj1-f181.google.com with SMTP id n11so4725598lji.5 for ; Thu, 10 Dec 2020 21:49:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=YQ5/y0MBmr9AG0HtGLpozj2zYXg22GpR6WwDqKDSeXY=; b=pE3xeXE6NbWlHAMCVPEzLhtP1IA2IMFJXr6R6wnFBNKxUNpGIVB3otvN3Zs25lqn/2 1b8Q2qQiDvCXo9thvGHAwMFPhJPeuhw2m5reyd9DP75nsP2xPNPNWAQfNm4zSbTXH39G W5LBeBCCRSgqBoY472Rx4wph42H3EuQ0EWx1tLGo2iNr++LBVcSeMxIhwMXVNt9UlduX Disgde1wtVgUomG2C5zy75wcGXSdY2yrwOv8jz/E8XUyjfj7wfy63NeqDtJScexkNuFz u67boIKXmk7jwaRAHGGjeg19HOuvreWAG1w3t/2Gaofh2H14WdPiUMC76Fxrtwhx3p6x tLsg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=YQ5/y0MBmr9AG0HtGLpozj2zYXg22GpR6WwDqKDSeXY=; b=uOQzfDqlagEe5hMgIwRpu27k+v+VBlpnPfgQpfBBmkjsYklqw9SjnrQcZiKpOMP5M4 IbcNab1BrdrS1n+bBKSWjqZTTYUKbcMoyl3y++DGfRyaQWL3m2Q9PFQyu1XNSd0UrXO0 68K0zHdyHRlhczDyvvEP1d6s0S7d18gtVIfVEGO0EK0p/iWlAxosN/jn83xgLz68XMRk 0u0d2h3yVCYYgAYjyXCsh+gHWHaUbaRNL3FvAAt/Q668LVvfeMvM7pdUdVUksetb3B5u VJRZqRPpJex8JuCACUvW0YYob4dTOtrB8Xm2aoLhxII5Sc9p8LdOQ2w54fDO/XVLXUJU r9Yw== X-Gm-Message-State: AOAM532J2357afWUSpVF/6J+9ostFAswc/jlaAKc279d8jOuGbO1oYRh 9aKczCYGBosYAPOaQmTYqlH2N6LDftEEwPGyLY7hmbSshc9GSg== X-Google-Smtp-Source: ABdhPJw+wL5odYK2EF5cUZgPfQ2iJH7Dx086Q8ch06IfLMTg6BpiPey5BYwfy0ji2dsOrJzbp2FEJFaBN7ZL0Hge6Kg= X-Received: by 2002:a2e:b166:: with SMTP id a6mr4198812ljm.100.1607665748746; Thu, 10 Dec 2020 21:49:08 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Andy Grove Date: Thu, 10 Dec 2020 22:48:57 -0700 Message-ID: Subject: Re: [Rust] DataFusion performance To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000df261c05b629d9d7" --000000000000df261c05b629d9d7 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable oops,. I managed to hit a magical key combination that sent the mail prematurely. Let's try that again. Hi Matthew, I went ahead and created a PR to add this to our benchmark suite. I will aim to finish this over the weekend. https://github.com/apache/arrow/pull/8896 If I run in debug mode with "cargo run --bin movies" I get a timing of 526 ms, which seems similar to the timing you are seeing. If I run in release mode with "cargo run --release --bin movies" then the time drops down to 21 ms. Are you running in release mode? Thanks, Andy. On Thu, Dec 10, 2020 at 10:46 PM Andy Grove wrote: > Hi Matthew, > > I went ahead and created a PR to add this to our benchmark suite. I will > aim to finish this over the weekend. > > > On Thu, Dec 10, 2020 at 3:11 PM Matthew Turner < > matthew.m.turner@outlook.com> wrote: > >> Hello, >> >> >> >> I=E2=80=99ve been playing around with DataFusion to explore the feasibil= ity of >> replacing current python/pandas data processing jobs with Rust/datafusio= n. >> Ultimately, looking to improve performance / decrease cost. >> >> >> >> I was doing some simple tests to start to measure performance difference= s >> on a simple task (read a csv[1] and filter it). >> >> >> >> Reading the csv datafusion seemed to outperform pandas by around 30% >> which was nice. >> >> *Rust took around 20-25ms to read the csv (compared to 32ms from pandas) >> >> >> >> However, when filtering the data I was surprised to see that pandas was >> way faster. >> >> *Rust took around 500-600ms to filter the csv(compared to 1ms from panda= s) >> >> >> >> My code for each is below. I know I should be running the DataFusion >> times through something similar to pythons %timeit but I didn=E2=80=99t = have that >> immediately accessible and I ran many times to confirm it was roughly >> consistent. >> >> >> >> Is this performance expected? Or am I using datafusion incorrectly? >> >> >> >> Any insight is much appreciated! >> >> >> >> [Rust] >> >> ``` >> >> use datafusion::error::Result; >> >> use datafusion::prelude::*; >> >> use std::time::Instant; >> >> >> >> #[tokio::main] >> >> async fn main() -> Result<()> { >> >> let start =3D Instant::now(); >> >> >> >> let mut ctx =3D ExecutionContext::new(); >> >> >> >> let ratings_csv =3D "ratings_small.csv"; >> >> >> >> let df =3D ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap()= ; >> >> println!("Read CSV Duration: {:?}", start.elapsed()); >> >> >> >> let q_start =3D Instant::now(); >> >> let results =3D df >> >> .filter(col("userId").eq(lit(1)))? >> >> .collect() >> >> .await >> >> .unwrap(); >> >> println!("Filter duration: {:?}", q_start.elapsed()); >> >> >> >> println!("Duration: {:?}", start.elapsed()); >> >> >> >> Ok(()) >> >> } >> >> ``` >> >> >> >> [Python] >> >> ``` >> >> In [1]: df =3D pd.read_csv(=E2=80=9Cratings_small.csv=E2=80=9D) >> >> 32.4 ms =C2=B1 210 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs, 10= loops each) >> >> >> >> In [2]: df.query(=E2=80=9CuserId=3D=3D1=E2=80=9D) >> >> 1.16 ms =C2=B1 24.5 =C2=B5s per loop (mean =C2=B1 std. dev. of 7 runs, 1= 000 loops each) >> >> ``` >> >> >> >> [1]: >> https://www.kaggle.com/rounakbanik/the-movies-dataset?select=3Dratings.c= sv >> >> >> >> >> >> *Matthew M. Turner* >> >> Email*:* matthew.m.turner@outlook.com >> >> Phone: (908)-868-2786 >> >> >> > --000000000000df261c05b629d9d7 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
oops,. I managed to hit a magical key combination tha= t sent the mail prematurely. Let's try that again.

<= /div>
Hi Matthew,

I went ahead and created a P= R to add this to our benchmark suite. I will aim to finish this over the we= ekend.


If I run in debug mode with "cargo run --bin movies" I get= a timing of 526 ms, which seems similar to the timing you are seeing.

If I run in release mode with "cargo run --releas= e --bin movies" then the time drops down to 21 ms.

Are you running in release mode?

Thanks,

Andy.


On Thu, Dec 10, 2020 at 10:4= 6 PM Andy Grove <andygrove73@gm= ail.com> wrote:
Hi Matthew,

I went ah= ead and created a PR to add this to our benchmark suite. I will aim to fini= sh this over the weekend.


On Thu, Dec 10, 2020 at 3:11 = PM Matthew Turner <matthew.m.turner@outlook.com> wrote:

Hello,

=C2=A0

I=E2=80=99ve been playing around with DataFusion to = explore the feasibility of replacing current python/pandas data processing = jobs with Rust/datafusion.=C2=A0 Ultimately, looking to improve performance= / decrease cost.

=C2=A0

I was doing some simple tests to start to measure pe= rformance differences on a simple task (read a csv[1] and filter it).

=C2=A0

Reading the csv datafusion seemed to outperform pand= as by around 30% which was nice.

*Rust took around 20-25ms to read the csv (compared = to 32ms from pandas)

=C2=A0

However, when filtering the data I was surprised to = see that pandas was way faster.

*Rust took around 500-600ms to filter the csv(compar= ed to 1ms from pandas)

=C2=A0

My code for each is below.=C2=A0 I know I should be = running the DataFusion times through something similar to pythons %timeit b= ut I didn=E2=80=99t have that immediately accessible and I ran many times t= o confirm it was roughly consistent.

=C2=A0

Is this performance expected? Or am I using datafusi= on incorrectly?

=C2=A0

Any insight is much appreciated!

=C2=A0

[Rust]

```

use datafusion::error::Result;

use datafusion::prelude::*;

use std::time::Instant;

=C2=A0

#[tokio::main]

async fn main() -> Result<()> {

=C2=A0=C2=A0=C2=A0 let start =3D Instant::now();<= /u>

=C2=A0

=C2=A0=C2=A0=C2=A0 let mut ctx =3D ExecutionContext:= :new();

=C2=A0

=C2=A0=C2=A0=C2=A0 let ratings_csv =3D "ratings= _small.csv";

=C2=A0

=C2=A0=C2=A0=C2=A0 let df =3D ctx.read_csv(ratings_c= sv, CsvReadOptions::new()).unwrap();

=C2=A0=C2=A0=C2=A0 println!("Read CSV Duration:= {:?}", start.elapsed());

=C2=A0

=C2=A0=C2=A0=C2=A0 let q_start =3D Instant::now();

=C2=A0=C2=A0=C2=A0 let results =3D df<= /p>

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .filter(c= ol("userId").eq(lit(1)))?

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .collect(= )

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .await=

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .unwrap()= ;

=C2=A0=C2=A0=C2=A0 println!("Filter duration: {= :?}", q_start.elapsed());

=C2=A0

=C2=A0=C2=A0=C2=A0 println!("Duration: {:?}&quo= t;, start.elapsed());

=C2=A0

=C2=A0=C2=A0=C2=A0 Ok(())

}

```

=C2=A0

[Python]

```

In [1]: df =3D pd.read_csv(=E2=80=9Cratings_small.cs= v=E2=80=9D)

32.4 ms =C2=B1 210 =C2=B5s per loop (mean =C2=B1 std= . dev. of 7 runs, 10 loops each)

=C2=A0

In [2]: df.query(=E2=80=9CuserId=3D=3D1=E2=80=9D)=

1.16 ms =C2=B1 24.5 =C2=B5s per loop (mean =C2=B1 st= d. dev. of 7 runs, 1000 loops each)

```

=C2=A0

[1]: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=3Dratings.csv<= /a>

=C2=A0

=C2=A0

Matthew M. Turner

Email: matthew.m.turner@outlook.com

Phone: (908)-868-2786

=C2=A0

--000000000000df261c05b629d9d7--