cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From p..@xvalheru.org
Subject Re: loading big amount of data to Cassandra
Date Sat, 03 Aug 2019 10:06:25 GMT
Thanks to all,

I'll try the SSTables.

Thanks

Pat

On 2019-08-03 09:54, Dimo Velev wrote:
> Check out the CQLSSTableWriter java class -
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java
> . You use it to generate sstables - you need to write a small program
> for that. You can then stream them over the network using the
> sstableloader (either use the utility or use the underlying classes to
> embed it in your program).
> 
> On 3. Aug 2019, at 07:17, Ayub M <hiayub@gmail.com> wrote:
> 
>> Dimo, how do you generate sstables? Do you mean load data locally on
>> a cassandra node and use sstableloader?
>> 
>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <dimo.velev@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> Batches will actually slow down the process because they mean a
>>> different thing in C* - as you read they are just grouping changes
>>> together that you want executed atomically.
>>> 
>>> Cassandra does not really have indices so that is different than a
>>> relational DB. However, after writing stuff to Cassandra it
>>> generates many smallish partitions of the data. These are then
>>> joined in the background together to improve read performance.
>>> 
>>> You have two options from my experience:
>>> 
>>> Option 1: use normal CQL api in async mode. This will create a
>>> high CPU load on your cluster. Depending on whether that is fine
>>> for you that might be the easiest solution.
>>> 
>>> Option 2: generate sstables locally and use the sstableloader to
>>> upload them into the cluster. The streaming does not generate high
>>> cpu load so it is a viable option for clusters with other
>>> operational load.
>>> 
>>> Option 2 scales with the number of cores of the machine generating
>>> the sstables. If you can split your data you can generate sstables
>>> on multiple machines. In contrast, option 1 scales with your
>>> cluster. If you have a large cluster that is idling, it would be
>>> better to use option 1.
>>> 
>>> With both options I was able to write at about 50-100K rows / sec
>>> on my laptop and local Cassandra. The speed heavily depends on the
>>> size of your rows.
>>> 
>>> Back to your question — I guess option2 is similar to what you
>>> are used to from tools like sqlloader for relational DBMSes
>>> 
>>> I had a requirement of loading a few 100 mio rows per day into an
>>> operational cluster so I went with option 2 to offload the cpu
>>> load to reduce impact on the reading side during the loads.
>>> 
>>> Cheers,
>>> Dimo
>>> 
>>> Sent from my iPad
>>> 
>>>> On 2. Aug 2019, at 18:59, pat@xvalheru.org wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I need to upload to Cassandra about 7 billions of records. What
>>> is the best setup of Cassandra for this task? Will usage of batch
>>> speeds up the upload (I've read somewhere that batch in Cassandra
>>> is dedicated to atomicity not to speeding up communication)? How
>>> Cassandra internally works related to indexing? In SQL databases
>>> when uploading such amount of data is suggested to turn off
>>> indexing and then turn on. Is something simmillar possible in
>>> Cassandra?
>>>> 
>>>> Thanks for all suggestions.
>>>> 
>>>> Pat
>>>> 
>>>> ----------------------------------------
>>>> Freehosting PIPNI - http://www.pipni.cz/
>>>> 
>>>> 
>>>> 
>>> 
>> 
> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>> 
>>> 
>>> 
>> 
> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
> 
> ---------------------------------------------------------------------------
> 
> Freehosting PIPNI - http://www.pipni.cz/

----------------------------------------
Freehosting PIPNI - http://www.pipni.cz/


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


Mime
View raw message