cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Olson <>
Subject Any Bulk Load on Large Data Set Advice?
Date Thu, 17 Nov 2016 13:58:00 GMT
I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote
IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized
data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3)
to store the data, with Spark doing the aggregation. 

Data set will be immutable once loaded, and am using the replication factor = 3 to somewhat
simulate the real world. Most of the analysis will be of the sort "Give me all the remote
ip addresses for source IP 'X' between time t1 and t2" 

I built and tested a bulk loader following this example in GitHub:
to generate the SSTables, but I have not executed it on the entire data set yet. 

Any advice on how to execute the bulk load under this configuration? Can I generate the SSTables
in parallel? Once generated, can I write the SSTables to all nodes simultaneously? Should
I be doing any kind of sorting by the partition key? 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks in advance!

View raw message