I am currently in the process of writing a hardware proposal for a Cassandra cluster for storing a lot of monitoring time series data. My workload is write intensive and my data set is extremely varied in types of variables and insertion rate for these variables (I will have to handle an order of 2 million variables coming in, each at very different rates - the majority of them will come at very low rates but there are many that will come at higher rates constant rates and a few coming in with huge spikes in rates). These variables correspond to all basic C++ types and arrays of these types. The highest insertion rates are received for basic types, out of which U32 variables seem to be the most prevalent (e.g. I recorded 2 million U32 vars were inserted in 8 mins of operation while 600.000 doubles and 170.000 strings were inserted during the same time. Note this measurement was only for a subset of the total data currently taken in).
At the moment I am partitioning the data in Cassandra in 75 CFs (each CF corresponds to a logical partitioning of the set of variables mentioned before - but this partitioning is not related with the amount of data or rates...it is somewhat random). These 75 CFs account for ~1 million of the variables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each node is a 4 real core with 4 GB RAM and split commit log directory and data file directory between two RAID arrays with HDDs). I can handle the load in this configuration but the average CPU usage of the Cassandra nodes is slightly above 50%. As I will need to add 12 more CFs (corresponding to another ~ 1 million variables) plus potentially other data later, it is clear that I need better hardware (also for the retrieval part).
I am looking at Dell servers (Power Edge etc)
1. Is anyone using Dell HW for their Cassandra clusters? How do they behave? Anybody care to share their configurations or tips for buying, what to avoid etc?
2. Obviously I am going to keep to the advice on the http://wiki.apache.org/cassandra/CassandraHardware and split the commmitlog and data on separate disks. I was going to use SSD for commitlog but then did some more research and found out that it doesn't make sense to use SSDs for sequential appends because it won't have a performance advantage with respect to rotational media. So I am going to use rotational disk for the commit log and an SSD for data. Does this make sense?
3. What's the best way to find out how big my commitlog disk and my data disk has to be? The Cassandra hardware page says the Commitlog disk shouldn't be big but still I need to choose a size!
4. I also noticed RAID 0 configuration is recommended for the data file directory. Can anyone explain why?
Sorry for the huge email.....