cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Kua (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
Date Mon, 20 Jul 2015 19:42:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633984#comment-14633984
] 

David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:41 PM:
---------------------------------------------------------------

https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to figuring out
the token ranges from the token ring, starting some subprocesses, giving each subprocess a
subset of the ranges, and have them perform the queries asynchronously and pass each formatted
page back to the parent process to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made, see here: https://github.com/dkua/cassandra-dtest/tree/bulk_export
They have also been submitted to the dtest repo on Github as a PR.

-----

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The
table was created using the following command `tools/bin/cassandra-stress write n=10M -rate
threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.


was (Author: dkua):
https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to figuring out
the token ranges from the token ring, starting some subprocesses, giving each subprocess a
subset of the ranges, and have them perform the queries asynchronously and pass each formatted
page back to the parent process to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made. They have been
submitted to the dtest repo on Github as a PR.

-----

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The
table was created using the following command `tools/bin/cassandra-stress write n=10M -rate
threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.

> COPY TO improvements
> --------------------
>
>                 Key: CASSANDRA-9304
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: David Kua
>            Priority: Minor
>              Labels: cqlsh
>             Fix For: 2.1.x
>
>
> COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious improvement could
be to parallelize reading and writing (write one page of data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message