cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artur R <>
Subject HELP with bulk loading
Date Thu, 09 Mar 2017 23:01:44 GMT
Hello all!

There are ~500gb of CSV files and I am trying to find the way how to upload
them to C* table (new empty C* cluster of 3 nodes, replication factor 2)
within reasonable time (say, 10 hours using 3-4 instance of c3.8xlarge EC2

My first impulse was to use CQLSSTableWriter, but it is too slow is of
single instance and I can't efficiently parallelize it (just creating Java
threads) because after some moment it always "hangs" (looks like GC is
overstressed) and eats all available memory.

So the questions are:
1. What is the best way to bulk-load huge amount of data to new C* cluster?

This comment here:

The preferred way to bulk load is now COPY; see CASSANDRA-11053
> <> and linked tickets

is confusing because I read that the CQLSSTableWriter + sstableloader is
much faster than COPY. Who is right?

2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
Maybe ready to use libraries like:

3. sstableloader is slow too. Assuming that I have new empty C* cluster,
how can I improve the upload speed? Maybe disable replication or some other
settings while streaming and then turn it back?


View raw message