cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Fri, 05 Feb 2016 09:33:40 GMT


Stefania commented on CASSANDRA-11053:

I've made some small optimizations and I've cythonized the copyutil module in pylib. I've
also experimented with non-prepared statements since we spend most of the time parsing data
and binding parameters.

Here are the results for the 1KB test:

||module cythonized||Prepared Statements||rows per second||total time||
|None|Yes|39,100| 8' 43''|
|None|No|50,900| 6' 42''|
|Driver|Yes|64,300| 5' 18''|
|Driver|No|77,000| 4' 25''|
|Driver + copyutil|Yes|70,700| 4' 49''|
|Driver + copyutil|No|87,300| 3' 54''|

Please note that the non prepared statements code still needs cleaning up, specifically I
need to add a check on missing primary key values so it might slow down slightly. Non prepared
statements are faster in this set-up because the cluster is oversized. They may be terrible
in other set-ups with smaller clusters, they not only move all the parsing to cassandra nodes
but they also force each batch statement to be recompiled. I will add a flag to allow using
non prepared statements but the default will stay with prepared statements enabled.

We still also have an issue with real time reporting, the faster the performance gets the
less accurate the real time reporting is. I need to address this.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>                 Key: CASSANDRA-11053
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.

This message was sent by Atlassian JIRA

View raw message