That spreadsheet doesn't take compression into account, which is very important in my case. Uncompressed, my data is going to require a petabyte of storage according to the spreadsheet. I am pretty sure I won't get that much storage to play with.

The spreadsheet also shows that Cassandra wastes unbelievable amount of space on compaction. My experiments with LevelDB however show that it is possible for write-optimized database to use negligible compaction space. I am not sure how LevelDB does it. I guess it splits the larger sstables into smaller chunks and merges them incrementally.

Anyway, does anybody know how densely can I store the data with Cassandra when compression is enabled? Would I have to implement some smart adaptive grouping to fit lots of records in one row or is there a simpler solution?

Dňa 4. 10. 2013 1:56 Andrey Ilinykh wrote / napísal(a):
It may help.
https://docs.google.com/spreadsheet/ccc?key=0Atatq_AL3AJwdElwYVhTRk9KZF9WVmtDTDVhY0xPSmc#gid=0


On Thu, Oct 3, 2013 at 1:31 PM, Robert Važan <robert.vazan@gmail.com> wrote:
I need to store one trillion data points. The data is highly compressible down to 1 byte per data point using simple custom compression combined with standard dictionary compression. What's the most space-efficient way to store the data in Cassandra? How much per-row overhead is there if I store one data point per row?

The data is particularly hard to group. It's a large number of time series with highly variable density. That makes it hard to pack subsets of the data into meaningful column families / wide rows. Is there a table layout scheme that would allow me to approach the 1B per data point without forcing me to implement complex abstraction layer on application level?