cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Fri, 26 Feb 2016 04:19:18 GMT


Stefania commented on CASSANDRA-11053:

bq. {{del cassandra.deserializers.DesBytesType}} causes the parser to default back to the
patched cqltypes.BytesType

That's interesting. It definitely works. The performance of COPY TO for a benchmark with only
blobs drops from 150k rows/sec to about 120k locally but the opposite would probably be true
for a benchmark with CQL composite types. It would be very nice to remove the formatting changes
from this patch, especially if it needs to go to 2.1. I've got a [separated branch|]
without the formatting changes. I'm unsure what to do: parsing the CQL type is safer but it
is also bolted onto an existing simpler logic that just relies on Python types and it makes
this patch more complex than it needs to be. WDYT?


+    else:
+        if last < len(val) - 1:
+            ret.append(val[last:].strip())

Fixed, thank you.


+        if table_meta:
+            cqltypes = [table_meta.columns[c].typestring if c in table_meta.columns else
None for c in colnames]
There is an API change in driver 3.0 (C* cqlsh 2.2+) that will impact this.

I'm aware of this, I believe all that's needed is to replace {{typestring}} with {{cql_type}}.

bq. This brings us to the question of targeting 2.1. cqlsh in 2.1 was diverging from 2.2+,
and is even more so after CASSANDRA-10513 (2.1 did not receive the driver 3.0 upgrade). I'm
interested to hear the input on whether this should go to 2.1.

I've asked offline regarding the target version, hopefully we'll know soon.

*"fix progress report"*
It's part of the summary, but I don't see anything in the changeset related to progress reporting.
I ran an identical load with 2.1.13 and noticed that progress samples
are much less frequent on this branch

The progress report was fixed by two things:

* the worker processes only feed aggregated results when the entire chunk is completed rather
than for every batch; this decreased dramatically the number of results to be collected and
also explains the change in frequency of the progress report. You will have noticed that now
the progress increments by a multiple of the chunk size, rather than batch sizes. The report
frequency is still 4 times per second but if no chunks were completed during this interval
then it will not change, this is expected.

* the introduction of the feeder process; the only job of the parent process is now to collect
results. Before it was sending data and collecting results; depending on ingest rate and polling
sleep time, it could fall behind schedule. 

*side note*
should we be using repr, or forcing high precision when doing copies to avoid loss of precision
(or providing a precision option for COPY FROM)?

The problem isn't COPY FROM, it's COPY TO exporting with the precision of cqlsh, which by
default is too low. I've created CASSANDRA-11255 to add a new COPY TO option, since this is
not related to performance and it's definitely a new feature.


I'm undecided on two more things:

* the default INGESTRATE: 200k may be a little bit too high and I'm thinking of changing it
back to 100k or maybe 120k-150k.
* the number of default worker processes is no longer capped, I think it is safer to reintroduce
the cap of 16, which people can override via NUMPROCESSES.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>                 Key: CASSANDRA-11053
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.

This message was sent by Atlassian JIRA

View raw message