cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Bromhead <>
Subject Re: Any Bulk Load on Large Data Set Advice?
Date Thu, 17 Nov 2016 17:25:02 GMT
+1 on parquet and S3.

Combined with spark running on spot instances your grant money will go much

On Thu, 17 Nov 2016 at 07:21 Jonathan Haddad <> wrote:

> If you're only doing this for spark, you'll be much better off using
> parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
> not all that great at it.
> On Thu, Nov 17, 2016 at 6:05 AM Joe Olson <> wrote:
> I received a grant to do some analysis on netflow data (Local IP address,
> Local Port, Remote IP address, Remote Port, time, # of packets, etc) using
> Cassandra and Spark. The de-normalized data set is about 13TB out the door.
> I plan on using 9 Cassandra nodes (replication factor=3) to store the data,
> with Spark doing the aggregation.
> Data set will be immutable once loaded, and am using the replication
> factor = 3 to somewhat simulate the real world. Most of the analysis will
> be of the sort "Give me all the remote ip addresses for source IP 'X'
> between time t1 and t2"
> I built and tested a bulk loader following this example in GitHub:
> to generate the
> SSTables, but I have not executed it on the entire data set yet.
> Any advice on how to execute the bulk load under this configuration?  Can
> I generate the SSTables in parallel? Once generated, can I write the
> SSTables to all nodes simultaneously? Should I be doing any kind of sorting
> by the partition key?
> This is a lot of data, so I figured I'd ask before I pulled the trigger.
> Thanks in advance!
> --
Ben Bromhead
CTO | Instaclustr <>
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

View raw message