On Wed, Jun 8, 2011 at 12:19 AM, AJ <aj@dude.podzone.net> wrote:
On 6/7/2011 9:32 PM, Edward Capriolo wrote:
<snip>

I do not like large disk set-ups. I think they end up not being economical. Most low latency use cases want high RAM to DISK ratio.  Two machines with 32GB RAM is usually less expensive then one machine with 64GB ram.

For a machine with 1TB drives (or multiple 1TB drives) it is going to be difficult to get enough RAM to help with random read patterns.

Also cluster operations like joining, decommissioning, or repair can take a *VERY* long time maybe a day. More smaller servers like blade style or more agile.


Is there some rule-of-thumb as to how much RAM is needed per GB of data?  I know it probably "depends", but if you could try to explain the best you can that would be great!  I too am projecting "big data" requirements.


The way this is normally explained is active-set. IE you have 100,000,000 users but at any given time only 1,000,000 are active thus you need enough RAM to keep these users cached.

No there is no rule of thumb it depends on access patterns. In the most extreme case you are using cassandra for an ETL workload. In this case your data will far exceed your RAM and since most operations will be like a "full table scan" caching is almost hopeless and useless. On the other side there are those that want every lookup to be predicatable low latency and totally random read and those might want to maintain a 1-1 ratio.

I would track these things over time:
reads/writes to c*
disk utilization
size of CF on disk
cache hit rate
latency

And eventually you find what your ratio is. IE.

last month:
i had 30 reads/sec
my disk was 40% utilized
my column family was 40 GB
my cache hit was 70%
my latency was 1ms

this month:
i had 45 reads/sec
my disk was 95% utilized
my column family was 40 GB
my cache hit was 30%
my latency was 5ms

Conclusion:
my disk maxed and my cache hit/rate is dropping. I probably need more nodes|or more RAM.