cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Fri, 12 Feb 2016 12:02:18 GMT


Stefania commented on CASSANDRA-11053:

Here are the latest results:

|NONE|YES|7|1,000|44,115|7' 44'"|43,700 -> 44,000|
|NONE|NO|7|1,000|58,345|5' 51"|57,800 -> 58,200|
|DRIVER|YES|7|1,000|77,719|4' 23"|77,300 -> 77,600|
|DRIVER|NO \(*\)|7|1,000|94,508 \(*\)|3' 36"|94,000 -> 95,000|
|DRIVER|YES|15|1,000|78,429|4' 21"|77,900 -> 78,300|
|DRIVER|YES|7|10,000|78,746|4' 20"|78,000 -> 78,500|
|DRIVER|YES|7|5,000|79,337|4" 18"|78,900 -> 79,200|
|DRIVER|YES|8|5,000|81,636|4' 10"|80,900 -> 81,500|
|DRIVER|YES|9|5,000|*82,584*|4' 8"|82,000 -> 82,500|
|DRIVER|YES|10|5,000|82,486|4' 8"|81,800 -> 82,400|
|DRIVER|YES|9|2500|82,013|4' 9"|81,500 -> 81,900|
|DRIVER + COPYUTIL|YES|9|5,000|*88,187*|3' 52"|87,900 -> 88,100|
|DRIVER + COPYUTIL|NO \(*\)|9|5,000|87,860 \(*\)|3' 53"|99,600 -> 93,800|

I've also saved the results in a [spreadsheet|].

The column on the right contains two approximate observations of the real-time rate at about
half-way through and just before finishing. It's purpose is simply to verify that the real-time
rate is fine now, it no longer lags behind as it used to do. 

The test runs with a \(*\) were affected by time outs, indicating the cluster had reached
capacity. This is to be expected given that with non-prepared statements we shift the parsing
burden to cassandra nodes forcing them to compile each batch statement as well. I don't consider
this a particularly good thing to do, as it is only applicable when the cluster is over-sized
and therefore I focused my efforts and search for optimal parameters to the case with prepared
statements (the default). In the very last run, we can see how half-way through we had an
average of 99,600 but it then plummeted just before finishing due to a long pause (there is
an exponential back-off policy that kicks in on timeouts).

The improvements over the [last set of results|]
are mostly due to tailored optimizations of Python code via the Python [line profiler|].
I've also reduced the amount of data sent from worker processes to the parent by aggregating
results. This helped the real time reporting tremendously. I've also added support for libev
if it is installed, as described in the driver [installation guide|].
Finally, I fixed a problem with type formatting introduced by the cythonized driver.

With these improvements, together with those previously adopted, worker and parent processes
are no longer as tightly coupled and I therefore experimented with the number of worker processes
and the chunk size. The default number of worker processes is 7 (num-cores minus 1). However
it seems from observation that num-cores + 1 gives better results. I've monitored vmstats
with {{dstat}} and the running tasks were reasonable (less than 2*num-cores). As for the chunk
size, the default value of 1000 is probably too small, and it seems 5000 is a better value
for this particular dataset and environment. However, I don't propose that we change the current
default values as they are safer for smaller environments such as laptops.

I've also spent time trying to improve csv parsing times, by comparing alternatives based
on [pandas|], [numpy|] and [numba|]
but none were worth pursuing further, at least not for this benchmark with very simple type
conversions (text and integers). For more complex data types, such as dates or collections,
perhaps pure cython conversion functions would help significantly.

Whilst I still have a new set of profiler results to analyse, I feel that we are reaching
a point where our efforts could be better spent elsewhere due to diminishing returns. As a
comparison, cassandra stress with approx 1KB partitions inserted 5M rows at a rate of 93k
rows per second. As this is well within 10% of our results, I suggest we should consider focussing
on alternative means of optimizations for wider user cases, such as supporting binary formats
for COPY TO / FROM or optimizing text conversion of complex data types.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>                 Key: CASSANDRA-11053
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.

This message was sent by Atlassian JIRA

View raw message