arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Grove <andygrov...@gmail.com>
Subject Re: [Rust] DataFusion performance
Date Fri, 11 Dec 2020 05:46:23 GMT
Hi Matthew,

I went ahead and created a PR to add this to our benchmark suite. I will
aim to finish this over the weekend.


On Thu, Dec 10, 2020 at 3:11 PM Matthew Turner <matthew.m.turner@outlook.com>
wrote:

> Hello,
>
>
>
> I’ve been playing around with DataFusion to explore the feasibility of
> replacing current python/pandas data processing jobs with Rust/datafusion.
> Ultimately, looking to improve performance / decrease cost.
>
>
>
> I was doing some simple tests to start to measure performance differences
> on a simple task (read a csv[1] and filter it).
>
>
>
> Reading the csv datafusion seemed to outperform pandas by around 30% which
> was nice.
>
> *Rust took around 20-25ms to read the csv (compared to 32ms from pandas)
>
>
>
> However, when filtering the data I was surprised to see that pandas was
> way faster.
>
> *Rust took around 500-600ms to filter the csv(compared to 1ms from pandas)
>
>
>
> My code for each is below.  I know I should be running the DataFusion
> times through something similar to pythons %timeit but I didn’t have that
> immediately accessible and I ran many times to confirm it was roughly
> consistent.
>
>
>
> Is this performance expected? Or am I using datafusion incorrectly?
>
>
>
> Any insight is much appreciated!
>
>
>
> [Rust]
>
> ```
>
> use datafusion::error::Result;
>
> use datafusion::prelude::*;
>
> use std::time::Instant;
>
>
>
> #[tokio::main]
>
> async fn main() -> Result<()> {
>
>     let start = Instant::now();
>
>
>
>     let mut ctx = ExecutionContext::new();
>
>
>
>     let ratings_csv = "ratings_small.csv";
>
>
>
>     let df = ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap();
>
>     println!("Read CSV Duration: {:?}", start.elapsed());
>
>
>
>     let q_start = Instant::now();
>
>     let results = df
>
>         .filter(col("userId").eq(lit(1)))?
>
>         .collect()
>
>         .await
>
>         .unwrap();
>
>     println!("Filter duration: {:?}", q_start.elapsed());
>
>
>
>     println!("Duration: {:?}", start.elapsed());
>
>
>
>     Ok(())
>
> }
>
> ```
>
>
>
> [Python]
>
> ```
>
> In [1]: df = pd.read_csv(“ratings_small.csv”)
>
> 32.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
>
>
> In [2]: df.query(“userId==1”)
>
> 1.16 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>
> ```
>
>
>
> [1]:
> https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv
>
>
>
>
>
> *Matthew M. Turner*
>
> Email*:* matthew.m.turner@outlook.com
>
> Phone: (908)-868-2786
>
>
>

Mime
View raw message