cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3
Date Tue, 15 Dec 2015 09:36:46 GMT


Stefania commented on CASSANDRA-9302:

bq. Now that we're not choosing session based on replica host, we might further simplify split_batches
to just group by partition key (i.e., no need for get_replica). Alternatively, if you want
to send to a specific host other than one that load balancing would choose, we would need
to borrow a connection and send directly on that (I don't think that's worth doing).

We need to batch by replica rather than just by partition key as the scope is much wider.
Initially I was batching only by primary key but that gave very bad results for workloads
with unique primary keys, like the one we normally use to benchmark these tools, _cassandra-stress_.
If the current approach does not guarantee we contact the same host then we must borrow a
connection to ensure that's the case or revert back to individual sessions, since we do have
a cap of max_requests, we would have to ensure sessions are closed when we are finished with
them rather than at the very end.

INGESTRATE is used to throttle sending more work but it cannot be smaller than a single workload
unit (chunk_size * max_requests * num_processes). I'll update the documentation at a minimum,
or see if this can be simplified.

I'll fix the other two minor points as well, so moving back to in progress.

> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>                 Key: CASSANDRA-9302
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
> We've had some discussion moving to Spark CSV import for bulk load in 3.x, but people
need a good bulk load tool now.  One option is to add a separate Java bulk load tool (CASSANDRA-9048),
but if we can match that performance from cqlsh I would prefer to leave COPY FROM as the preferred
option to which we point people, rather than adding more tools that need to be supported indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and CASSANDRA-8225.

This message was sent by Atlassian JIRA

View raw message