We have 1.5 TB running smooth. index_interval: 1024 and 8GB JVM. Default bloomfilters.
The only pb we have is that We have 2TB SSD so they are almost full, C* starts crashing. It looks like cassandra consider there is no more space available, when there is still 500GB available (You're not supposed to use 50%+ disk space).

All operations are slower of course with these loads (Bootstrap, Repair, cleanup, ...).

Yet I read on datastax website that MAX size is around 300 - 500 GB for C* < 1.2.x and 3 to 5 GB after (under certain conditions, but taking profit of off heap BF / caches etc.). Vnodes should also help reducing the time needed for some operations.

Currently we have 480-520 GB of data per node, so it's not even close to 1TB, but I'd bet that reaching 700-800GB shouldn't be a problem in terms of "everyday performance" - heap space is quite low, no GC issues etc. (to give you a comparison: when working on 1.1 and having ~300-400GB per node we had a huge problem with bloom filters and heap space, so we had to bump it to 12-16 GB; on 1.2 it's not an issue anymore).

However, our main concern is the time that we'll need to rebuild broken node, so we are going to extend the cluster soon to avoid such problems and keep our nodes about 50% smaller.


That's what I thought. I have tried all the avenues, will give ParNew a
try. With the 1.0.xx I have issues when data sizes go up, hopefully that
will not be the case with 1.2.

Just curious, has anyone tried 1.2 with large data set, around 1 TB ?

I was experimenting with 128 vs. 512 some time ago and I was unable to see
any difference in terms of performance. I'd probably check 1024 too, but we
migrated to 1.2 and heap space was not an issue anymore.


  I changed my index_interval from 128 to index_interval: 128 to 512, does
make sense to increase more than this ?

  Have a look to index_interval.


  The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X
though. We had tuned bloom filters (0.1) and AFAIK making it lower than
this won't matter.

  Which Cassandra version are you on? Essentially heap size is function
number of keys/metadata. In Cassandra 1.2 lot of the metadata like
filters were moved off heap.

  Does anyone know what would roughly be the heap size for cassandra
1TB of data ? We started with about 200 G and now on one of the nodes
are already on 1 TB. We were using 8G of heap and that served us well
until we reached 700 G where we started seeing failures and nodes

With 1 TB of data the node refuses to come back due to lack of memory.
needless to say repairs and compactions takes a lot of time. We upped
heap from 8 G to 12 G and suddenly everything started moving rapidly
the repair tasks and the compaction tasks. But soon (in about 9-10
hrs) we
started seeing the same symptoms as we were seeing with 8 G.

So my question is how do I determine what is the optimal size of heap
for data around 1 TB ?

Thanks !