cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Wed, 03 Feb 2016 08:47:40 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130041#comment-15130041
] 

Stefania edited comment on CASSANDRA-11053 at 2/3/16 8:47 AM:
--------------------------------------------------------------

I've repeated the benchmark after performing the following:

* Moved csv decoding from parent to worker processes
* Switched to a cythonized driver installation (for cassandra-2.1 we need to use version 2.7.2).


The patch is [here|https://github.com/stef1927/cassandra/tree/11053-2.1].

The set-up and raw results are in _copy_from_large_benchmark_2.txt_ attached, along with the
new profiler results, _parent_profile_2.txt_ and _worker_profiles_2.txt_. The rate has increased
from *35,000* to *58,000* rows per second for the 1KB test:

{code}cqlsh> COPY test.test1kb FROM 'DSEBulkLoadTest/in/data1KB/*.csv';
Using 7 child processes

Starting copy of test.test1kb with columns ['pkey', 'ccol', 'data'].
Processed: 20480000 rows; Rate:   63987 rows/s; Avg. rate:   58749 rows/s
20480000 rows imported from 20 files in 5 minutes and 48.605 seconds (0 skipped).
{code}


The progress reporting looks much better now,  because the parent process no longer spends
time decoding csv data and it has therefore more time to receive data and update the progress.

The parent process is fine now, even if we optimized it further it wouldn't matter since it
spends most of its time receiving data (289 out of 437 seconds).

The worker processes can still be improved, here is where we currently spend most of the time
and what I am looking at improving:

{code}
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.606    3.606  432.297  432.297 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1743(run_normal)
   158538   86.237    0.001  245.629    0.002 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1800(send_normal_batch)
   161485   67.401    0.000  167.543    0.001 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1879(split_batches)
   158538   17.603    0.000   84.718    0.001 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1820(convert_rows)
   158538   46.302    0.000   73.577    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1874(execute_statement)
  2947000   37.470    0.000   61.277    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1918(get_replica)
  2947000   27.978    0.000   60.145    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1596(get_row_values)
  2947000    9.383    0.000   32.285    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1625(get_row_partition_key_values)
  8841000   21.195    0.000   31.220    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1600(convert)
  2947000   20.328    0.000   21.711    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1630(serialize)
  2947000    9.220    0.000   20.478    0.000 {filter}
     2948    0.040    0.000   15.513    0.005 /usr/lib/python2.7/multiprocessing/queues.py:113(get)
     2948   12.770    0.004   12.770    0.004 {method 'recv' of '_multiprocessing.Connection'
objects}
  8841000   11.258    0.000   11.258    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1925(<lambda>)
   158558    7.525    0.000   10.034    0.000 /usr/local/lib/python2.7/dist-packages/cassandra_driver-2.7.2-py2.7-linux-x86_64.egg/cassandra/io/libevreactor.py:355(push)
{code}

I note that 58,000 rows per second is an upper limit both locally against a single cassandra
node running on the same laptop, or on an {{r3.2xlarge}} AWS instance running against a large
cassandra cluster of 8 nodes running on {{i2.2xlarge}} AWS instances. So the bottleneck is
still with the importer even after these initial improvements. The same is true for cassandra
loader, it won't move much beyond 50,000 rows per second with the 1KB benchmark (using default
parameters and a batch size of 8). 

{code}
time ./cassandra-loader -f DSEBulkLoadTest/in/data1KB -host 172.31.16.79 -schema "test.test1kb(pkey,ccol,data)"
-batchSize 8
[...]
Lines Processed:        20480019  Rate:         50073.39608801956

real    6m50.633s
user    19m0.819s
sys     3m25.961s
{code}

Is this the right command [~brianmhess]?


was (Author: stefania):
I've repeated the benchmark after performing the following:

* Moved csv decoding from parent to worker processes
* Switched to a cythonized driver installation (for cassandra-2.1 we need to use version 2.7.2).


The patch is [here|https://github.com/stef1927/cassandra/tree/11053-2.1].

The set-up and raw results are in _copy_from_large_benchmark_2.txt_ attached, along with the
new profiler results, _parent_profile_2.txt_ and _worker_profiles_2.txt_. The rate has increased
from *35,000* to *58,000* rows per second for the 1KB test:

{code}cqlsh> COPY test.test1kb FROM 'DSEBulkLoadTest/in/data1KB/*.csv';
Using 7 child processes

Starting copy of test.test1kb with columns ['pkey', 'ccol', 'data'].
Processed: 20480000 rows; Rate:   63987 rows/s; Avg. rate:   58749 rows/s
20480000 rows imported from 20 files in 5 minutes and 48.605 seconds (0 skipped).
{code}


The progress reporting looks much better now,  because the parent process no longer spends
time decoding csv data and it has therefore more time to receive data and update the progress.

The parent process is fine, even if we optimized it further it wouldn't matter since it spends
most of its time receiving data (289 out of 437 seconds).

The worker processes can still be improved, here is where we currently spend most of the time
and what I am looking at improving:

{code}
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.606    3.606  432.297  432.297 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1743(run_normal)
   158538   86.237    0.001  245.629    0.002 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1800(send_normal_batch)
   161485   67.401    0.000  167.543    0.001 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1879(split_batches)
   158538   17.603    0.000   84.718    0.001 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1820(convert_rows)
   158538   46.302    0.000   73.577    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1874(execute_statement)
  2947000   37.470    0.000   61.277    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1918(get_replica)
  2947000   27.978    0.000   60.145    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1596(get_row_values)
  2947000    9.383    0.000   32.285    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1625(get_row_partition_key_values)
  8841000   21.195    0.000   31.220    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1600(convert)
  2947000   20.328    0.000   21.711    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1630(serialize)
  2947000    9.220    0.000   20.478    0.000 {filter}
     2948    0.040    0.000   15.513    0.005 /usr/lib/python2.7/multiprocessing/queues.py:113(get)
     2948   12.770    0.004   12.770    0.004 {method 'recv' of '_multiprocessing.Connection'
objects}
  8841000   11.258    0.000   11.258    0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1925(<lambda>)
   158558    7.525    0.000   10.034    0.000 /usr/local/lib/python2.7/dist-packages/cassandra_driver-2.7.2-py2.7-linux-x86_64.egg/cassandra/io/libevreactor.py:355(push)
{code}

I note that 58,000 rows per second is an upper limit both locally against a single cassandra
node running on the same laptop, or on an {{r3.2xlarge}} AWS instance running against a large
cassandra cluster of 8 nodes running on {{i2.2xlarge}} AWS instances. So the bottleneck is
still with the importer even after these initial improvements. The same is true for cassandra
loader, it won't move much beyond 50,000 rows per second with the 1KB benchmark (using default
parameters and a batch size of 8). 

{code}
time ./cassandra-loader -f DSEBulkLoadTest/in/data1KB -host 172.31.16.79 -schema "test.test1kb(pkey,ccol,data)"
-batchSize 8
[...]
Lines Processed:        20480019  Rate:         50073.39608801956

real    6m50.633s
user    19m0.819s
sys     3m25.961s
{code}

Is this the right command [~brianmhess]?

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message