cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Wed, 17 Feb 2016 10:32:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150253#comment-15150253
] 

Stefania commented on CASSANDRA-11053:
--------------------------------------

Here are the latest results:

||MODULE CYTHONIZED||PREPARED STATEMENTS||NUM. WORKER PROCESSES||CHUNK SIZE||AVERAGE ROWS
/ SEC||TOTAL TIME||
|DRIVER|YES|7|5,000|97,146|3' 31"|
|DRIVER|YES|8|5,000|103,037|3' 19"|
|DRIVER|YES|9|5,000|104,070|3' 17"|
|DRIVER|YES|10|5,000|*104,498*|3' 16"|
|DRIVER COPYUTIL|YES|7|5,000|89,123|3' 48"|
|DRIVER COPYUTIL|YES|8|5,000|107,897|3' 10"|
|DRIVER COPYUTIL|YES|9|5,000|*109,871*|3' 7"|
|DRIVER COPYUTIL|YES|10|5,000|109,616|3' 8"|

In addition to using separate pipes as mentioned above, I've found one more optimization and
I've calibrated how much data the parent process sends to the worker processes. Two default
parameters have changed: the max ingest rate is now 150k and the report frequency has changed
from 4 times per second to 2. I've run cqlsh with {{SCHED_BATCH}} CPU scheduling ({{schedtool
-B -e ./bin/cqlsh}}) (it helps a little bit, maybe 2-3k rows/second) and I've changed the
clock source from {{xen}} to {{tlc}} (unsure if this helps but it doesn't hurt).

I would like to repeat the tests on an AWS instance with twice the number of cores, to see
how much we can scale. I've already verified that if we half the number of cores (by fixing
the affinity to only 4 cores) then the throughput also halves. I'm thinking of testing on
C4.4xlarge. So far I've used R3.2xlarge but we don't need all that memory and so I would like
to try a C4 instance instead. 

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message