cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefania Alborghetti <stefania.alborghe...@datastax.com>
Subject Re: HELP with bulk loading
Date Fri, 10 Mar 2017 01:09:48 GMT
When I tested cqlsh COPY FROM for CASSANDRA-11053
<https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>,
I was able to import about 20 GB in under 4 minutes on a cluster with 8
nodes using the same benchmark created for cassandra-loader, provided the
driver was Cythonized, instructions in this blog post
<http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>.
The performance was similar to cassandra-loader.

Depending on your schema, one or the other may do slightly better.

On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <rs@foundev.pro> wrote:

> I suggest using cassandra loader
>
> https://github.com/brianmhess/cassandra-loader
>
> On Mar 9, 2017 5:30 PM, "Artur R" <artur@gpnxgroup.com> wrote:
>
>> Hello all!
>>
>> There are ~500gb of CSV files and I am trying to find the way how to
>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>> c3.8xlarge EC2 nodes).
>>
>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>> single instance and I can't efficiently parallelize it (just creating Java
>> threads) because after some moment it always "hangs" (looks like GC is
>> overstressed) and eats all available memory.
>>
>> So the questions are:
>> 1. What is the best way to bulk-load huge amount of data to new C*
>> cluster?
>>
>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>>
>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>>> tickets
>>
>>
>> is confusing because I read that the CQLSSTableWriter + sstableloader is
>> much faster than COPY. Who is right?
>>
>> 2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?
>>
>> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
>> how can I improve the upload speed? Maybe disable replication or some other
>> settings while streaming and then turn it back?
>>
>> Thanks!
>> Artur.
>>
>


-- 

<http://www.datastax.com/>

STEFANIA ALBORGHETTI

Software engineer | +852 6114 9265 | stefania.alborghetti@datastax.com


[image: http://www.datastax.com/cloud-applications]
<http://www.datastax.com/cloud-applications>

Mime
View raw message