cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Holmberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Fri, 26 Feb 2016 21:48:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169899#comment-15169899
] 

Adam Holmberg commented on CASSANDRA-11053:
-------------------------------------------

bq. The performance of COPY TO for a benchmark with only blobs drops from 150k rows/sec to
about 120k
I didn't expect it to be that punishing since there's no deserialization happening there.
That must just be the cost of the dispatch back to Python. Here's another option: I could
build in another deserializer for BytesType that returns a bytearray. You would then patch
in as follows:
{code}
>>> deserializers.DesBytesType = deserializers.DesBytesTypeByteArray

>>> s.execute('select c from test.t limit 1')[0]
    Row(c=bytearray(b'\xde\xad\xbe\xef'))
{code}
I can get it in the upcoming release if it would be useful for this integration.

bq. I'm unsure what to do: parsing the CQL type is safer but ...
I was also on the fence due to the new complexity. I think I favor the cql type interpretation
despite the complexity for one reason: this decouples formatting from driver return values.
They don't change often, but when they have required specialization for evolving feature support
(set-->SortedSet, dict-->OrderedMap), that would ripple into cqlsh. If we're basing
formatting on cql, that is avoided.

bq. The progress report was fixed by two things...
Thanks. I figured out what my problem was. I was missing most of the diff because I overlooked
on github: "761 additions, 409 deletions not shown because the diff is too large." I have
more to look at

bq. I'm undecided on two more things...default INGESTRATE...default worker processes
I generally err on the side of caution. Reasonable limits would prevent someone from inadvertently
crushing a server with a basic command. The command options make it easy enough to dial up
for big load operations.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message