Is it possible you are generating exclusively deletes for this table?


On 5 February 2014 00:10, Robert Wille <rwille@fold3.com> wrote:
I ran my test again, and Flush Writer’s “All time blocked” increased to 2 and then shortly thereafter GC went into its death spiral. I doubled memtable_flush_writers (to 2) and memtable_flush_queue_size (to 8) and tried again.

This time, the table that always sat with Memtable data size = 0 now showed increases in Memtable data size. That was encouraging. It never flushed, which isn’t too surprising, because that table has relatively few rows and they are pretty wide. However, on the fourth table to clean, Flush Writer’s “All time blocked” went to 1, and then there were no more completed events, and about 10 minutes later GC went into its death spiral. I assume that each time Flush Writer completes an event, that means a table was flushed. Is that right? Also, I got two dropped mutation messages at the same time that Flush Writer’s All time blocked incremented.

I then increased the writers and queue size to 3 and 12, respectively, and ran my test again. This time All time blocked remained at 0, but I still suffered death by GC.

I would almost think that this is caused by high load on the server, but I’ve never seen CPU utilization go above about two of my eight available cores. If high load triggers this problem, then that is very disconcerting. That means that a CPU spike could permanently cripple a node. Okay, not permanently, but until a manual flush occurs.

If anyone has any further thoughts, I’d love to hear them. I’m quite at the end of my rope.

Thanks in advance

Robert

From: Nate McCall <nate@thelastpickle.com>
Reply-To: <user@cassandra.apache.org>
Date: Saturday, February 1, 2014 at 9:25 AM
To: Cassandra Users <user@cassandra.apache.org>
Subject: Re: Lots of deletions results in death by GC

What's the output of 'nodetool tpstats' while this is happening? Specifically is Flush Writer "All time blocked" increasing? If so, play around with turning up memtable_flush_writers and memtable_flush_queue_size and see if that helps.


On Sat, Feb 1, 2014 at 9:03 AM, Robert Wille <rwille@fold3.com> wrote:
A few days ago I posted about an issue I’m having where GC takes a long time (20-30 seconds), and it happens repeatedly and basically no work gets done. I’ve done further investigation, and I now believe that I know the cause. If I do a lot of deletes, it creates memory pressure until the memtables are flushed, but Cassandra doesn’t flush them. If I manually flush, then life is good again (although that takes a very long time because of the GC issue). If I just leave the flushing to Cassandra, then I end up with death by GC. I believe that when the memtables are full of tombstones, Cassadnra doesn’t realize how much memory the memtables are actually taking up, and so it doesn’t proactively flush them in order to free up heap.

As I was deleting records out of one of my tables, I was watching it via nodetool cfstats, and I found a very curious thing:

                Memtable cell count: 1285
                Memtable data size, bytes: 0
                Memtable switch count: 56

As the deletion process was chugging away, the memtable cell count increased, as expected, but the data size stayed at 0. No flushing occurred. 

Here’s the schema for this table:

CREATE TABLE bdn_index_pub (

tshard VARCHAR,

pord INT,

ord INT,

hpath VARCHAR,

page BIGINT,

PRIMARY KEY (tshard, pord)

WITH gc_grace_seconds = 0 AND compaction = { 'class' : 'LeveledCompactionStrategy''sstable_size_in_mb' : 160 };


I have a few tables that I run this cleaning process on, and not all of them exhibit this behavior. One of them reported an increasing number of bytes, as expected, and it also flushed as expected. Here’s the schema for that table:


CREATE TABLE bdn_index_child (

ptshard VARCHAR,

ord INT,

hpath VARCHAR,

PRIMARY KEY (ptshard, ord)

) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };


In both cases, I’m deleting the entire record (i.e. specifying just the first component of the primary key in the delete statement). Most records in bdn_index_pub have 10,000 rows per record. bdn_index_child usually has just a handful of rows, but a few records can have up 10,000.

Still a further mystery, 1285 tombstones in the bdn_index_pub memtable doesn’t seem like nearly enough to create a memory problem. Perhaps there are other flaws in the memory metering. Or perhaps there is some other issue that causes Cassandra to mismanage the heap when there are a lot of deletes. One other thought I had is that I page through these tables and clean them out as I go. Perhaps there is some interaction between the paging and the deleting that causes the GC problems and I should create a list of keys to delete and then delete them after I’ve finished reading the entire table. 

I reduced memtable_total_space_in_mb from the default (probably 2.7 GB) to 1 GB, in hopes that it would force Cassandra to flush tables before I ran into death by GC, but it didn’t seem to help.

I’m using Cassandra 2.0.4.

Any insights would be greatly appreciated. I can’t be the only one that has periodic delete-heavy workloads. Hopefully someone else has run into this and can give advice.

Thanks

Robert



--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com