cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremiah Jordan (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Thu, 17 Mar 2016 19:22:33 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeremiah Jordan reopened CASSANDRA-11053:
-----------------------------------------

On the following node:
{noformat}
Linux atest-55c62b1 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64
x86_64 x86_64 GNU/Linux
{noformat}

Running cassandra-3.0 HEAD this copy change is broken for a simple test.

{code}
$ cat kv.cql
create keyspace if not exists cvs_copy_ks with replication = {'class': 'SimpleStrategy', 'replication_factor':1};
create table if not exists cvs_copy_ks.kv (key int primary key, value text);
truncate cvs_copy_ks.kv;
copy cvs_copy_ks.kv (key, value) from 'kv.csv' with header='true';
select * from cvs_copy_ks.kv;
{code}

{code}
$ cat kv.csv
key,value
1,'a'
2,'b'
3,'c'
{code}

If I run that it just hangs.
{code}
./cqlsh -f kv.cql
Using 1 child processes

Starting copy of cvs_copy_ks.kv with columns ['key', 'value'].
{code}

I added some debug and it hangs in spinning here 

https://github.com/apache/cassandra/blob/cassandra-3.0/pylib/cqlshlib/copyutil.py#L1166

Because channels is an empty list.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>              Labels: doc-impacting
>             Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared statements should
be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message