cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Holmberg (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Thu, 25 Feb 2016 20:10:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167777#comment-15167777
] 

Adam Holmberg edited comment on CASSANDRA-11053 at 2/25/16 8:10 PM:
--------------------------------------------------------------------

bq. At least for the time being I decided to look directly into the CQL type name...but I
am no so sure how it would be possible with the cython extensions.
Thanks for the explanation. I also think that makes cqlsh more robust. However, if you did
want to avoid the extra complexity, there is a way to bypass Cython deserialization when that
protocol handler is in use:
{code}
del cassandra.deserializers.DesBytesType
{code}
This causes the parser to default back to the patched cqltypes.BytesType.

A few other thoughts...

*cqlshlib.formatting.get_sub_types:*
{code}
+    else:
+        if last < len(val) - 1:
+            ret.append(val[last:].strip())
{code}
This block will always run since there is no break from the loop. Consider moving it out of
the {{else}} to make this clearer?

*bin/cqlsh.Shell.print_static_result*
{code}
+        if table_meta:
+            cqltypes = [table_meta.columns[c].typestring if c in table_meta.columns else
None for c in colnames]
{code}
There is an API change in driver 3.0 (C* cqlsh 2.2+) that will impact this.
This brings us to the question of targeting 2.1. cqlsh in 2.1 was diverging from 2.2+, and
is even more so after CASSANDRA-10513 (2.1 did not receive the driver 3.0 upgrade). I'm interested
to hear the input on whether this should go to 2.1.

*"fix progress report"*
It's part of the summary, but I don't see anything in the [changeset|https://github.com/apache/cassandra/compare/cassandra-2.1...stef1927:11053-2.1]
related to progress reporting. I ran an identical load with 2.1.13 and noticed that progress
samples
are much less frequent on this branch (by a factor of 3). Both progressions were roughly linear.
I don't suspect this change, but just thought I'd mention in case something unintentional
happened between 2.1.13 and here.

*side note*
Unrelated to this change, but I stumbled upon an SO question at the same time as I was reviewing
this ticket:
http://stackoverflow.com/q/35632114/20688
I'm now wondering: should we be using repr, or forcing high precision when doing copies to
avoid loss of precision (or providing a precision option for COPY FROM)?


was (Author: aholmber):
bq. At least for the time being I decided to look directly into the CQL type name...but I
am no so sure how it would be possible with the cython extensions.
Thanks for the explanation. I also think that makes cqlsh more robust. However, if you did
want to avoid the extra complexity, there is a way to bypass Cython deserialization when that
protocol handler is in use:
{code}
del cassandra.deserializers.DesBytesType
{code}
This causes the parser to default back to the patched cqltypes.BytesType.

A few other thoughts...

*cqlshlib.formatting.get_sub_types:*
{code}
+    else:
+        if last < len(val) - 1:
+            ret.append(val[last:].strip())
{code}
This block will always run since there is no break from the loop. Consider moving it out of
the {{else}} to make this clearer?

*bin/cqlsh.Shell.print_static_result*
{code}
+        if table_meta:
+            cqltypes = [table_meta.columns[c].typestring if c in table_meta.columns else
None for c in colnames]
{code}
There is an API change in driver 3.0 (C* cqlsh 2.2+) that will impact this.
This brings us to the question of targeting 2.1. cqlsh in 2.1 was diverging from 2.2+, and
is even more so
after CASSANDRA-10513 (2.1 did not receive the driver 3.0 upgrade). I'm interested to hear
the input on whether this should go to 2.1.

*"fix progress report"*
It's part of the summary, but I don't see anything in the [changeset|https://github.com/apache/cassandra/compare/cassandra-2.1...stef1927:11053-2.1]
related to progress reporting. I ran an identical load with 2.1.13 and noticed that progress
samples
are much less frequent on this branch (by a factor of 3). Both progressions were roughly linear.
I don't suspect this change, but just thought I'd mention in case something unintentional
happened between 2.1.13 and here.

*side note*
Unrelated to this change, but I stumbled upon an SO question at the same time as I was reviewing
this ticket:
http://stackoverflow.com/q/35632114/20688
I'm now wondering: should we be using repr, or forcing high precision when doing copies to
avoid loss of precision (or providing a precision option for COPY FROM)?

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message