Hi All,
I am trying to understand the relationship between data set/SSTable(s) size and
Cassandra heap.
Q1. Here is the memory calc from the Wiki:
For a rough rule of thumb, Cassandra's internal datastructures will require
about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches.
This formula does not depend on the data set size. Does this mean that provided
Cassandra has sufficient disk space to accommodate growing data set, it can run
in fixed memory for bulk load? Am I right that memory impact of compacting
increasing SSTAble sizes is capped by a parameter
in_memory_compaction_limit_in_mb?
Q2. What would I need to monitor to predict ahead the need to double the number
of nodes assuming sufficient storage per node? Is there a simple rule of thumb
saying that for a heap of size X a node can handle SSTable of size Y? I do
realize that the i/o and CPU play a role here but could that be reduced to a
factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming
random partitioner and a fixed number of write clients.
Q3. Does the formula account for deserialization during reads? What does 1G
represent?
Thank you very much,
Oleg
