cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Jirsa <jeff.ji...@crowdstrike.com>
Subject Re: Any Bulk Load on Large Data Set Advice?
Date Thu, 17 Nov 2016 17:40:20 GMT
Other people are commenting on the appropriateness of Cassandra – they may have a point you
should consider, but I’m going to answer the question. 

 

1)       Yes, you can generate the sstables in parallel

2)       If you use sstable bulk loader interface (sstableloader), it’ll stream to all appropriate
nodes. You can run sstableloader from multiple nodes at the same time as well. 

3)       Sorting by partition key probably won’t hurt. If you run jobs in parallel, dividing
them up by partition key seems like a good way to parallelize your task. 

 

We do something like this in certain parts of our workflow, and it works well.  

 

 

 

From: Joe Olson <technology@nododos.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday, November 17, 2016 at 5:58 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Any Bulk Load on Large Data Set Advice?

 

I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote
IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized
data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3)
to store the data, with Spark doing the aggregation. 

 

Data set will be immutable once loaded, and am using the replication factor = 3 to somewhat
simulate the real world. Most of the analysis will be of the sort "Give me all the remote
ip addresses for source IP 'X' between time t1 and t2"

 

I built and tested a bulk loader following this example in GitHub: https://github.com/yukim/cassandra-bulkload-example
to generate the SSTables, but I have not executed it on the entire data set yet.

 

Any advice on how to execute the bulk load under this configuration?  Can I generate the SSTables
in parallel? Once generated, can I write the SSTables to all nodes simultaneously? Should
I be doing any kind of sorting by the partition key?

 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks in advance!

 

 


Mime
View raw message