cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksey Yeschenko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
Date Mon, 03 Nov 2014 16:06:36 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194662#comment-14194662
] 

Aleksey Yeschenko commented on CASSANDRA-8225:
----------------------------------------------

It can even have a csv-file input format, for all I care, afterwards.

But what's proposed here is close to pointless. If there is a lot of data to bulk load, then
you want it distributed, anyway. If there isn't, then 10x faster COPY FROM is still good enough.

Again, I 100% agree that we need to improve our bulk loading game. Yet I'm certain that what
we really need is not "Production-capable COPY FROM" but "Production-capable something-to-bulk-load-thats-not-necesserily-csvloader",
and the current issue title/description mention COPY FROM for the single reason that it's
the only simple thing we've ever had.

> Production-capable COPY FROM
> ----------------------------
>
>                 Key: CASSANDRA-8225
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>             Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 500,000 rows
that had an incrementing sequence number, date, and SSN. I then used our COPY command and
MySQL's LOAD DATA INFILE to load the file on my Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  fields
terminated by ',';
> Query OK, 500000 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with delimiter=',';
> 500000 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with delimiter=',';
> Processed 500000 rows; Write: 4037.46 rows/s
> 500000 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  Unfortunately we're
still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  The number
of users sophisticated enough to use the sstable writers is small and (relatively) decreasing
as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message