cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <>
Subject Re: Lots of deletions results in death by GC
Date Wed, 05 Feb 2014 00:10:48 GMT
I ran my test again, and Flush Writer¹s ³All time blocked² increased to 2
and then shortly thereafter GC went into its death spiral. I doubled
memtable_flush_writers (to 2) and memtable_flush_queue_size (to 8) and tried

This time, the table that always sat with Memtable data size = 0 now showed
increases in Memtable data size. That was encouraging. It never flushed,
which isn¹t too surprising, because that table has relatively few rows and
they are pretty wide. However, on the fourth table to clean, Flush Writer¹s
³All time blocked² went to 1, and then there were no more completed events,
and about 10 minutes later GC went into its death spiral. I assume that each
time Flush Writer completes an event, that means a table was flushed. Is
that right? Also, I got two dropped mutation messages at the same time that
Flush Writer¹s All time blocked incremented.

I then increased the writers and queue size to 3 and 12, respectively, and
ran my test again. This time All time blocked remained at 0, but I still
suffered death by GC.

I would almost think that this is caused by high load on the server, but
I¹ve never seen CPU utilization go above about two of my eight available
cores. If high load triggers this problem, then that is very disconcerting.
That means that a CPU spike could permanently cripple a node. Okay, not
permanently, but until a manual flush occurs.

If anyone has any further thoughts, I¹d love to hear them. I¹m quite at the
end of my rope.

Thanks in advance


From:  Nate McCall <>
Reply-To:  <>
Date:  Saturday, February 1, 2014 at 9:25 AM
To:  Cassandra Users <>
Subject:  Re: Lots of deletions results in death by GC

What's the output of 'nodetool tpstats' while this is happening?
Specifically is Flush Writer "All time blocked" increasing? If so, play
around with turning up memtable_flush_writers and memtable_flush_queue_size
and see if that helps.

On Sat, Feb 1, 2014 at 9:03 AM, Robert Wille <> wrote:
> A few days ago I posted about an issue I¹m having where GC takes a long time
> (20-30 seconds), and it happens repeatedly and basically no work gets done.
> I¹ve done further investigation, and I now believe that I know the cause. If I
> do a lot of deletes, it creates memory pressure until the memtables are
> flushed, but Cassandra doesn¹t flush them. If I manually flush, then life is
> good again (although that takes a very long time because of the GC issue). If
> I just leave the flushing to Cassandra, then I end up with death by GC. I
> believe that when the memtables are full of tombstones, Cassadnra doesn¹t
> realize how much memory the memtables are actually taking up, and so it
> doesn¹t proactively flush them in order to free up heap.
> As I was deleting records out of one of my tables, I was watching it via
> nodetool cfstats, and I found a very curious thing:
>                 Memtable cell count: 1285
>                 Memtable data size, bytes: 0
>                 Memtable switch count: 56
> As the deletion process was chugging away, the memtable cell count increased,
> as expected, but the data size stayed at 0. No flushing occurred.
> Here¹s the schema for this table:
> CREATE TABLE bdn_index_pub (
> tshard VARCHAR,
> pord INT,
> ord INT,
> hpath VARCHAR,
> page BIGINT,
> PRIMARY KEY (tshard, pord)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
> I have a few tables that I run this cleaning process on, and not all of them
> exhibit this behavior. One of them reported an increasing number of bytes, as
> expected, and it also flushed as expected. Here¹s the schema for that table:
> CREATE TABLE bdn_index_child (
> ptshard VARCHAR,
> ord INT,
> hpath VARCHAR,
> PRIMARY KEY (ptshard, ord)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
> In both cases, I¹m deleting the entire record (i.e. specifying just the first
> component of the primary key in the delete statement). Most records in
> bdn_index_pub have 10,000 rows per record. bdn_index_child usually has just a
> handful of rows, but a few records can have up 10,000.
> Still a further mystery, 1285 tombstones in the bdn_index_pub memtable doesn¹t
> seem like nearly enough to create a memory problem. Perhaps there are other
> flaws in the memory metering. Or perhaps there is some other issue that causes
> Cassandra to mismanage the heap when there are a lot of deletes. One other
> thought I had is that I page through these tables and clean them out as I go.
> Perhaps there is some interaction between the paging and the deleting that
> causes the GC problems and I should create a list of keys to delete and then
> delete them after I¹ve finished reading the entire table.
> I reduced memtable_total_space_in_mb from the default (probably 2.7 GB) to 1
> GB, in hopes that it would force Cassandra to flush tables before I ran into
> death by GC, but it didn¹t seem to help.
> I¹m using Cassandra 2.0.4.
> Any insights would be greatly appreciated. I can¹t be the only one that has
> periodic delete-heavy workloads. Hopefully someone else has run into this and
> can give advice.
> Thanks
> Robert

Nate McCall
Austin, TX

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting

View raw message