Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4929117A55 for ; Thu, 30 Oct 2014 22:51:36 +0000 (UTC) Received: (qmail 99737 invoked by uid 500); 30 Oct 2014 22:51:36 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 99658 invoked by uid 500); 30 Oct 2014 22:51:36 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 99476 invoked by uid 99); 30 Oct 2014 22:51:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2014 22:51:35 +0000 Date: Thu, 30 Oct 2014 22:51:35 +0000 (UTC) From: "Aleksey Yeschenko (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190978#comment-14190978 ] Aleksey Yeschenko commented on CASSANDRA-8225: ---------------------------------------------- Is CSV even the primary source of imported data in C*? On common CSV-exported datasets, does 1 second vs 1 minute even matter? Continuing to use COPY FROM + a java tool behind the scenes has its issues. COPY TO/FROM are being kept in sync wrt accepted formatting, and generally any cqlsh changes are reflected there. Now we'd have to emulate that logic in Java, too. And copy it first. We also do still have CASSANDRA-7793 and CASSANDRA-7794 open. They won't give us anything close to a 200x improvement, but 1) they might get us to something acceptable and 2) they are relatively LHF. Besides, since CASSANDRA-5894 it's not that complicated to write an sstablewriter. So this would be nice to have, but I'm not sure the ROI is there. > Production-capable COPY FROM > ---------------------------- > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools > Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of 500,000 rows that had an incrementing sequence number, date, and SSN. I then used our COPY command and MySQL's LOAD DATA INFILE to load the file on my Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test fields terminated by ','; > Query OK, 500000 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with delimiter=','; > 500000 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with delimiter=','; > Processed 500000 rows; Write: 4037.46 rows/s > 500000 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." The number of users sophisticated enough to use the sstable writers is small and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)