A few days ago I posted about an issue I’m having where GC takes a long time (20-30 seconds), and it happens repeatedly and basically no work gets done. I’ve done further investigation, and I now believe that I know the cause. If I do a lot of deletes, it creates memory pressure until the memtables are flushed, but Cassandra doesn’t flush them. If I manually flush, then life is good again (although that takes a very long time because of the GC issue). If I just leave the flushing to Cassandra, then I end up with death by GC. I believe that when the memtables are full of tombstones, Cassadnra doesn’t realize how much memory the memtables are actually taking up, and so it doesn’t proactively flush them in order to free up heap.

As I was deleting records out of one of my tables, I was watching it via nodetool cfstats, and I found a very curious thing:

                Memtable cell count: 1285
                Memtable data size, bytes: 0
                Memtable switch count: 56

As the deletion process was chugging away, the memtable cell count increased, as expected, but the data size stayed at 0. No flushing occurred. 

Here’s the schema for this table:

CREATE TABLE bdn_index_pub (

tshard VARCHAR,

pord INT,

ord INT,

hpath VARCHAR,

page BIGINT,

PRIMARY KEY (tshard, pord)

WITH gc_grace_seconds = 0 AND compaction = { 'class' : 'LeveledCompactionStrategy''sstable_size_in_mb' : 160 };

I have a few tables that I run this cleaning process on, and not all of them exhibit this behavior. One of them reported an increasing number of bytes, as expected, and it also flushed as expected. Here’s the schema for that table:

CREATE TABLE bdn_index_child (

ptshard VARCHAR,

ord INT,

hpath VARCHAR,

PRIMARY KEY (ptshard, ord)

) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

In both cases, I’m deleting the entire record (i.e. specifying just the first component of the primary key in the delete statement). Most records in bdn_index_pub have 10,000 rows per record. bdn_index_child usually has just a handful of rows, but a few records can have up 10,000.

Still a further mystery, 1285 tombstones in the bdn_index_pub memtable doesn’t seem like nearly enough to create a memory problem. Perhaps there are other flaws in the memory metering. Or perhaps there is some other issue that causes Cassandra to mismanage the heap when there are a lot of deletes. One other thought I had is that I page through these tables and clean them out as I go. Perhaps there is some interaction between the paging and the deleting that causes the GC problems and I should create a list of keys to delete and then delete them after I’ve finished reading the entire table. 

I reduced memtable_total_space_in_mb from the default (probably 2.7 GB) to 1 GB, in hopes that it would force Cassandra to flush tables before I ran into death by GC, but it didn’t seem to help.

I’m using Cassandra 2.0.4.

Any insights would be greatly appreciated. I can’t be the only one that has periodic delete-heavy workloads. Hopefully someone else has run into this and can give advice.