I have a cluster of 5 machines building a Cassandra datastore, and I load bulk data into this using the Java Thrift API. The first ~250GB runs fine, then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts are done with consistency level ALL.

I hope with this I have avoided all the 'usual dummy errors' that lead to OOM's. I have begun to troubleshoot the issue with JMX, however, it's difficult to catch the JVM in the right moment because it runs well for several hours before this thing happens.

One thing gets to my mind, maybe one of the experts could confirm or reject this idea for me: is it possible that when one machine slows down a little bit (for example because a big compaction is going on), the memtables don't get flushed to disk as fast as they are building up under the continuing bulk import? That would result in a downward spiral, the system gets slower and slower on disk I/O, but since more and more data arrives over Thrift, finally OOM.

I'm using the "periodic" commit log sync, maybe also this could create a situation where the commit log writer is too slow to catch up with the data intake, resulting in ever growing memory usage?

Maybe these thoughts are just bullshit. Let me now if so... ;-)