arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Turner <matthew.m.tur...@outlook.com>
Subject [Rust] DataFusion performance
Date Thu, 10 Dec 2020 22:11:29 GMT
Hello,

I've been playing around with DataFusion to explore the feasibility of replacing current python/pandas
data processing jobs with Rust/datafusion.  Ultimately, looking to improve performance / decrease
cost.

I was doing some simple tests to start to measure performance differences on a simple task
(read a csv[1] and filter it).

Reading the csv datafusion seemed to outperform pandas by around 30% which was nice.
*Rust took around 20-25ms to read the csv (compared to 32ms from pandas)

However, when filtering the data I was surprised to see that pandas was way faster.
*Rust took around 500-600ms to filter the csv(compared to 1ms from pandas)

My code for each is below.  I know I should be running the DataFusion times through something
similar to pythons %timeit but I didn't have that immediately accessible and I ran many times
to confirm it was roughly consistent.

Is this performance expected? Or am I using datafusion incorrectly?

Any insight is much appreciated!

[Rust]
```
use datafusion::error::Result;
use datafusion::prelude::*;
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<()> {
    let start = Instant::now();

    let mut ctx = ExecutionContext::new();

    let ratings_csv = "ratings_small.csv";

    let df = ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap();
    println!("Read CSV Duration: {:?}", start.elapsed());

    let q_start = Instant::now();
    let results = df
        .filter(col("userId").eq(lit(1)))?
        .collect()
        .await
        .unwrap();
    println!("Filter duration: {:?}", q_start.elapsed());

    println!("Duration: {:?}", start.elapsed());

    Ok(())
}
```

[Python]
```
In [1]: df = pd.read_csv("ratings_small.csv")
32.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: df.query("userId==1")
1.16 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

[1]: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv


Matthew M. Turner
Email: matthew.m.turner@outlook.com<mailto:matthew.m.turner@outlook.com>
Phone: (908)-868-2786


Mime
View raw message