Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6F25817931 for ; Mon, 2 Nov 2015 09:17:28 +0000 (UTC) Received: (qmail 74654 invoked by uid 500); 2 Nov 2015 09:17:28 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 74618 invoked by uid 500); 2 Nov 2015 09:17:28 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 74558 invoked by uid 99); 2 Nov 2015 09:17:28 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Nov 2015 09:17:28 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id F27B12C1F6D for ; Mon, 2 Nov 2015 09:17:27 +0000 (UTC) Date: Mon, 2 Nov 2015 09:17:27 +0000 (UTC) From: "Stefania (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984916#comment-14984916 ] Stefania edited comment on CASSANDRA-9302 at 11/2/15 9:16 AM: -------------------------------------------------------------- So far the most time consuming thing to implement has been text parsing in order to support prepared statements and the associated tests with composites and so forth. This should be done now. The biggest gain comes from batching however. According to the python profiler, we spend most of the time sending requests to the server; we cannot afford to do this for each statement especially if we want to take advantage of TAR and connection pools in the driver, we must call {{execute_async()}} therefore increasing the cost per request. Even batches as small as 10 statements have a huge impact as they reduce the work by a factor 10. I propose to batch as follows: pass to worker processes a big batch, approx 1000 statements (configurable). Each worker process than checks if it can group these entries by PK. If a PK group is more than 10 entries (configurable) we send this as a batch. Else we aggregate the remaining statements in a single batch. I've also added back-off and recovery, therefore CASSANDRA-9061 can be closed as a duplicate of this ticket. was (Author: stefania): So far the most time consuming thing to implement has been text parsing in order to support prepared statements and the associated tests with composites and so forth. This should be done now. The biggest gain comes from batching however. According to the python profiler, we spend most of the time creating messages to send to the server; we cannot afford to do this for each statement especially if we want to take advantage of TAR and connection pools in the driver, we must call {{execute_async()}} therefore increasing the cost per requested compared to creating a message passed directly to the connection (which is what we currently do). Even batches as small as 10 statements have a huge impact as they reduce the work by a factor 10. I propose to batch as follows: pass to worker processes a big batch, approx 1000 statements (configurable). Each worker process than checks if it can group these entries by PK. If a PK group is more than 10 entries (configurable) we send this as a batch. Else we aggregate the remaining statements in a single batch. I've also added back-off and recovery, therefore CASSANDRA-9061 can be closed as a duplicate of this ticket. > Optimize cqlsh COPY FROM, part 3 > -------------------------------- > > Key: CASSANDRA-9302 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9302 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Jonathan Ellis > Assignee: Stefania > Priority: Critical > Fix For: 2.1.x > > > We've had some discussion moving to Spark CSV import for bulk load in 3.x, but people need a good bulk load tool now. One option is to add a separate Java bulk load tool (CASSANDRA-9048), but if we can match that performance from cqlsh I would prefer to leave COPY FROM as the preferred option to which we point people, rather than adding more tools that need to be supported indefinitely. > Previous work on COPY FROM optimization was done in CASSANDRA-7405 and CASSANDRA-8225. -- This message was sent by Atlassian JIRA (v6.3.4#6332)