I have an application that consists of multiple (possible 1000's) of measurement series, and each measurement series generates a small amount of data output (only about 500 bytes) every 10 seconds. This time series of data should be stored in Cassandra in a fashion that both read access is possible for a given time range.
- first CF has key = measurement series ID, column name = timeuuid_of_output
- second CF has key = timeuuid_of_output, column value = data output (~ 500 bytes)
When someone requests a time range of data, I read the first CF, get a series of timeuuid's, and then do a row-multiget on the second CF.
This works great, but tends to be slow for big series of data (lets say for 10 days, nearly 100,000 records will be requested from the second CF). This load of 100,000 reads will be distributed through the cluster (because the second CF scales very nicely with a RandomPartitioner), but more or less one ends up with 100,000 individual read requests, at least that's what I suspect.
Can anyone say if there is a better data model for this type of queries? Would it be a reasonable improvement to put all data to a single CF with
- single CF, key = measurement series ID, column name = timeuuid_of_output, column value = data output
When I request a series of 100,000 columns from this row (now it's a single row), can the performance really be better? Is there any chance that Cassandra will be able to read this data "en bloc" from the hard drive?
Any advise is appreciated...