arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Grove <andygrov...@gmail.com>
Subject Re: [Rust] DataFusion performance
Date Fri, 11 Dec 2020 05:48:57 GMT
oops,. I managed to hit a magical key combination that sent the mail
prematurely. Let's try that again.

Hi Matthew,

I went ahead and created a PR to add this to our benchmark suite. I will
aim to finish this over the weekend.

https://github.com/apache/arrow/pull/8896

If I run in debug mode with "cargo run --bin movies" I get a timing of 526
ms, which seems similar to the timing you are seeing.

If I run in release mode with "cargo run --release --bin movies" then the
time drops down to 21 ms.

Are you running in release mode?

Thanks,

Andy.


On Thu, Dec 10, 2020 at 10:46 PM Andy Grove <andygrove73@gmail.com> wrote:

> Hi Matthew,
>
> I went ahead and created a PR to add this to our benchmark suite. I will
> aim to finish this over the weekend.
>
>
> On Thu, Dec 10, 2020 at 3:11 PM Matthew Turner <
> matthew.m.turner@outlook.com> wrote:
>
>> Hello,
>>
>>
>>
>> I’ve been playing around with DataFusion to explore the feasibility of
>> replacing current python/pandas data processing jobs with Rust/datafusion.
>> Ultimately, looking to improve performance / decrease cost.
>>
>>
>>
>> I was doing some simple tests to start to measure performance differences
>> on a simple task (read a csv[1] and filter it).
>>
>>
>>
>> Reading the csv datafusion seemed to outperform pandas by around 30%
>> which was nice.
>>
>> *Rust took around 20-25ms to read the csv (compared to 32ms from pandas)
>>
>>
>>
>> However, when filtering the data I was surprised to see that pandas was
>> way faster.
>>
>> *Rust took around 500-600ms to filter the csv(compared to 1ms from pandas)
>>
>>
>>
>> My code for each is below.  I know I should be running the DataFusion
>> times through something similar to pythons %timeit but I didn’t have that
>> immediately accessible and I ran many times to confirm it was roughly
>> consistent.
>>
>>
>>
>> Is this performance expected? Or am I using datafusion incorrectly?
>>
>>
>>
>> Any insight is much appreciated!
>>
>>
>>
>> [Rust]
>>
>> ```
>>
>> use datafusion::error::Result;
>>
>> use datafusion::prelude::*;
>>
>> use std::time::Instant;
>>
>>
>>
>> #[tokio::main]
>>
>> async fn main() -> Result<()> {
>>
>>     let start = Instant::now();
>>
>>
>>
>>     let mut ctx = ExecutionContext::new();
>>
>>
>>
>>     let ratings_csv = "ratings_small.csv";
>>
>>
>>
>>     let df = ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap();
>>
>>     println!("Read CSV Duration: {:?}", start.elapsed());
>>
>>
>>
>>     let q_start = Instant::now();
>>
>>     let results = df
>>
>>         .filter(col("userId").eq(lit(1)))?
>>
>>         .collect()
>>
>>         .await
>>
>>         .unwrap();
>>
>>     println!("Filter duration: {:?}", q_start.elapsed());
>>
>>
>>
>>     println!("Duration: {:?}", start.elapsed());
>>
>>
>>
>>     Ok(())
>>
>> }
>>
>> ```
>>
>>
>>
>> [Python]
>>
>> ```
>>
>> In [1]: df = pd.read_csv(“ratings_small.csv”)
>>
>> 32.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>
>>
>>
>> In [2]: df.query(“userId==1”)
>>
>> 1.16 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>
>> ```
>>
>>
>>
>> [1]:
>> https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv
>>
>>
>>
>>
>>
>> *Matthew M. Turner*
>>
>> Email*:* matthew.m.turner@outlook.com
>>
>> Phone: (908)-868-2786
>>
>>
>>
>

Mime
View raw message