cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3
Date Mon, 02 Nov 2015 09:17:27 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984916#comment-14984916
] 

Stefania edited comment on CASSANDRA-9302 at 11/2/15 9:16 AM:
--------------------------------------------------------------

So far the most time consuming thing to implement has been text parsing in order to support
prepared statements and the associated tests with composites and so forth. This should be
done now. The biggest gain comes from batching however. According to the python profiler,
we spend most of the time sending requests to the server; we cannot afford to do this for
each statement especially if we want to take advantage of TAR and connection pools in the
driver, we must call {{execute_async()}} therefore increasing the cost per request. Even batches
as small as 10 statements have a huge impact as they reduce the work by a factor 10.  

I propose to batch as follows: pass to worker processes a big batch, approx 1000 statements
(configurable). Each worker process than checks if it can group these entries by PK. If a
PK group is more than 10 entries (configurable) we send this as a batch. Else we aggregate
the remaining statements in a single batch.

I've also added back-off and recovery, therefore CASSANDRA-9061 can be closed as a duplicate
of this ticket.


was (Author: stefania):
So far the most time consuming thing to implement has been text parsing in order to support
prepared statements and the associated tests with composites and so forth. This should be
done now. The biggest gain comes from batching however. According to the python profiler,
we spend most of the time creating messages to send to the server; we cannot afford to do
this for each statement especially if we want to take advantage of TAR and connection pools
in the driver, we must call {{execute_async()}} therefore increasing the cost per requested
compared to creating a message passed directly to the connection (which is what we currently
do). Even batches as small as 10 statements have a huge impact as they reduce the work by
a factor 10.  

I propose to batch as follows: pass to worker processes a big batch, approx 1000 statements
(configurable). Each worker process than checks if it can group these entries by PK. If a
PK group is more than 10 entries (configurable) we send this as a batch. Else we aggregate
the remaining statements in a single batch.

I've also added back-off and recovery, therefore CASSANDRA-9061 can be closed as a duplicate
of this ticket.

> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>
>                 Key: CASSANDRA-9302
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9302
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
>
>
> We've had some discussion moving to Spark CSV import for bulk load in 3.x, but people
need a good bulk load tool now.  One option is to add a separate Java bulk load tool (CASSANDRA-9048),
but if we can match that performance from cqlsh I would prefer to leave COPY FROM as the preferred
option to which we point people, rather than adding more tools that need to be supported indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and CASSANDRA-8225.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message