I am thinking of strategies to deploy my application that uses a 3 node Cassandra cluster.
Quick recap: I have several client applications that feed in about 2 million different variables (each representing a different monitoring value/channel) in Cassandra. The system receives updates for each of these monitoring values at different rates. For each new update, the timestamp and value are recorded in a Cassandra name-value pair. The schema of Cassandra is built using one CF for data and 4 other CFs for metadata. The data CF uses rows as 4 hour time bins. The system can currently sustain the insertion load. Now I'm looking into retreival performance for random queries and organizing the flow of data in and out of the cluster.
The main concern at the moment is about organizing the flow of data in and out of the cluster. Why do I need to organize the data out? Well, my requirement is to keep all the data coming into the system at the highest granularity for long term (several years). The 3 node cluster I mentioned is the online cluster which is supposed to be able to absorb the input load for a relatively short period of time, a few weeks. After this period the data has to be shipped out of the cluster in a mass storage facility and the cluster needs to be emptied to make room for more data. Also, the online cluster will also serve reads while it takes in data.
One solution would be to stop the system every few weeks and export the data and then truncate the CFs and then start taking data again. In a few weeks a lot of data will be accumulated - hundreds of GBytes which makes the two operations lengthy and error prone. The problem is that the system cannot afford downtime. So I am looking for solutions to keep the online systems taking data and serving reads without being affected too much about exporting data out and truncating.
As DataStax splits the cluster in an online and offline part, I am thinking of having 2 nodes in one data center (DC_X) and the 3rd node in the other datacenter (DC_Y). The clients will be writing to all 3 nodes. Using a replication factor 2 ensures that replicas of the nodes in DC_X will always be sent to the node in DC_Y. That means that the cluster will be unbalanced but that's fine cause the node in DC_Y will contain all the data in the system. From time to time I can export the data in this node outside - which means that it's performance will go down a lot. Will the system be able to sustain the exporting of all data from node in DC_Y from time to time? After I finish exporting I will want to emty the data in the cluster. How about truncating the CFs? Can I truncate the CFs while the 3 nodes are in operation? Will this affect performance a lot? - I know it's probably dependent on data size....how to go about this?