cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Dzielak <>
Subject Re: Notes and questions from performing a large delete
Date Sat, 07 Dec 2013 19:58:19 GMT
Thanks Nate. I hadn't noticed that and it definitely explains it.

It'd be nice to see that called out much more clearly. As we found out the implications can
be severe!


On Thursday, December 5, 2013 at 11:30 AM, Nate McCall wrote:

> Per the 256mb to 5mb change, check the very last section of this page:
> "Changing any compaction or compression option erases all previous compaction or compression
> In other words, you have to include the whole 'WITH' clause each time - in the future
just grab the output from 'show schema' and add/modify as needed. 
> I did not know this either until it happened to me as well - could probably stand to
be a little bit more front-and-center, IMO. 
> On Wed, Dec 4, 2013 at 2:59 PM, Josh Dzielak < (>
> > We recently had a little Cassandra party I wanted to share and see if anyone has
notes to compare. Or can tell us what we did wrong or what we could do better. :) Apologies
in advance for the length of the narrative here. 
> > 
> > Task at hand: Delete about 50% of the rows in a large column family (~8TB) to reclaim
some disk. These are rows are used only for intermediate storage.
> > 
> > Sequence of events: 
> > 
> > - Issue the actual deletes. This, obviously, was super-fast.
> > - Nothing happens yet, which makes sense. New tombstones are not immediately compacted
b/c of gc_grace_seconds.
> > - Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL.
> > 
> > - Every node started working very hard. We saw disk space start to free up. It was
> > - Eventually the compactions finished and we had gotten a ton of disk back. 
> > - However, our SSTables were now 5Mb, not 256Mb as they had always been :(
> > - We inspected the schema in CQL/Opscenter etc and sure enough sstable_size_in_mb
had changed to 5Mb for this CF. Previously all CFs were set at 256Mb, and all other CF's still
> > 
> > - At 5Mb we had a huge number of SSTables. Our next goal was to get these tables
back to 256Mb. 
> > - First step was to update the schema back to 256Mb.
> > - Figuring out how to do this in CQL was tricky, because CQL has gone through a
lot of changes recently and getting the docs for your version is hard. Eventually we figured
it out - ALTER TABLE events WITH compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256};
> > - Out of our 12 nodes, 9 acknowledged the update. The others showed the old schema
> > - The remaining 3 would not. There was no extra load was on the systems, operational
status was very clean. All nodes could see each other.
> > - For each of the remaining 3 we tried to update the schema through a local cqlsh
session. The same ALTER TABLE would just hang forever.
> > - We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE again.
It worked this time. We finally had schema agreement.
> > 
> > - Starting with just 1 node, we kicked off upgradesstables, hoping it would rebuild
the 5Mb tables to 256Mb tables.
> > - Nothing happened. This was (afaik) because the sstable size change doesn't represent
a new version of schema for the sstables. So existing tables are ignored.
> > - We discovered the "-a" option for upgradesstables, which tells it to skip the
schema check just and just do all the tables anyway.
> > - We ran upgradesstables -a and things started happening. After a few hours the
pending compactions finished.
> > - Sadly, this node was now using 3x the disk it previously had. Some sstables were
now 256Mb, but not all. There were tens of thousands of ~20Mb tables.
> > - A direct comparison to other nodes owning the same % of the ring showed both the
same number of sstables and the same ratio of 256Mb+ tables to small tables. However, on a
'normal' node the small tables were all 5-6Mb and on the fat, upgraded node, all the tables
were 20Mb+. This was why the fat node was taking up 3x disk overall.
> > - I tried to see what was in those 20Mb files relative to the 5Mb ones but sstable2json
failed against our authenticated keyspace. I filed a bug (

> > - Had little choice here. We shut down the fat node, did a manual delete of sstables,
brought it back up and did a repair. It came back to the right size.
> > 
> > TL;DR / Our big questions are: 
> > How could the schema have spontaneously changed from 256Mb sstable_size_in_mb to
> > How could schema propagation failed such that only 9 of 12 nodes got the change
even when cluster was healthy? Why did updating schema locally hang until restart?
> > What could have happened inside of upgradesstables that created the node with the
same ring % but 3x disk load?
> > 
> > We're on Cassandra 1.2.8, Java 6, Ubuntu 12. Running on SSD's, 12 node cluster across
2 DCs. No compression, leveled compaction. Happy to provide more details. Thanks in advance
for any insights into what happened or any best practices we missed during this episode. 
> > 
> > Best,
> > Josh
> > 
> -- 
> -----------------
> Nate McCall
> Austin, TX
> @zznate
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting

View raw message