cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed Eljami <ahmed.elj...@gmail.com>
Subject Re: HELP with bulk loading
Date Fri, 10 Mar 2017 13:22:52 GMT
Hi,

>3. sstableloader is slow too. Assuming that I have new empty C* cluster,
how can I improve the upload speed? Maybe disable replication or some other
settings while streaming and then turn it back?

Maybe you can accelerate you load with the option -cph (connection per
host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000

With cph=12 and t=1000,  I went from 56min (default value) to 11min for
table of 50Gb.



2017-03-10 2:09 GMT+01:00 Stefania Alborghetti <
stefania.alborghetti@datastax.com>:

> When I tested cqlsh COPY FROM for CASSANDRA-11053
> <https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>,
> I was able to import about 20 GB in under 4 minutes on a cluster with 8
> nodes using the same benchmark created for cassandra-loader, provided the
> driver was Cythonized, instructions in this blog post
> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>.
> The performance was similar to cassandra-loader.
>
> Depending on your schema, one or the other may do slightly better.
>
> On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <rs@foundev.pro> wrote:
>
>> I suggest using cassandra loader
>>
>> https://github.com/brianmhess/cassandra-loader
>>
>> On Mar 9, 2017 5:30 PM, "Artur R" <artur@gpnxgroup.com> wrote:
>>
>>> Hello all!
>>>
>>> There are ~500gb of CSV files and I am trying to find the way how to
>>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>>> c3.8xlarge EC2 nodes).
>>>
>>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>>> single instance and I can't efficiently parallelize it (just creating Java
>>> threads) because after some moment it always "hangs" (looks like GC is
>>> overstressed) and eats all available memory.
>>>
>>> So the questions are:
>>> 1. What is the best way to bulk-load huge amount of data to new C*
>>> cluster?
>>>
>>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>>>
>>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>>>> tickets
>>>
>>>
>>> is confusing because I read that the CQLSSTableWriter + sstableloader is
>>> much faster than COPY. Who is right?
>>>
>>> 2. Is there any real examples of multi-threaded using of
>>> CQLSSTableWriter?
>>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass
>>> ?
>>>
>>> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
>>> how can I improve the upload speed? Maybe disable replication or some other
>>> settings while streaming and then turn it back?
>>>
>>> Thanks!
>>> Artur.
>>>
>>
>
>
> --
>
> <http://www.datastax.com/>
>
> STEFANIA ALBORGHETTI
>
> Software engineer | +852 6114 9265 <+852%206114%209265> |
> stefania.alborghetti@datastax.com
>
>
> [image: http://www.datastax.com/cloud-applications]
> <http://www.datastax.com/cloud-applications>
>
>
>
>


-- 
Cordialement;

Ahmed ELJAMI

Mime
View raw message