cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Coverston (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-1608) Redesigned Compaction
Date Wed, 27 Jul 2011 03:27:09 GMT


Benjamin Coverston commented on CASSANDRA-1608:

bq. Not a deal breaker for me – it's not hard to get old-style compactions to back up under
sustained writes, either. Given a choice between "block writes until compactions catch up"
or "let them back up and let the operater deal with it how he will," I'll take the latter.

Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea.

bq. Is it even worth keeping bloom filters around with such a drastic reduction in worst-case
number of sstables to check (for read path too)?

I think they are absolutely worth keeping around for unleveled sstables, but for leveled sstables
the value is certainly questionable. Perhaps having some kind of LRU cache where we have an
upper bound on the number of bloom filters we keep in memory would be wise. Is it possible
that we could move these off-heap?

bq. I'd like to have a better understanding of what the tradeoff is between making these settings
larger/smaller. Can we make these one-size-fits-all?

Some pros and cons here. The biggest con is that for a 64MB flushed sstable leveling that
file when we choose a 25MB leveled size will require us to run a compaction on approximately
314MB of data (25MB * 10 + 64MB) to get the data leveled into L1. If we choose 50MB for our
leveled size the math is the same, but we end up compacting 564MB of data. Taking into account
level based scoring (to choose the next compaction candidates), these settings become somewhat
dynamic and the interplay between flush size and sstable size is anything but subtle. A small
leveled size in combination with a large flushing memtable means that each time you merge
a flushed SSTable into L1 you could end up with many cycles of cascading compactions into
L2, and potentially into L3 and higher until the scores for L1, L2, and L3 normalize into
a range that again triggers compactions from L0 to L1.

I wanted to keep the time for each compaction to something < 10 seconds so I chose an sstable
size in the range of 5-10 MB and that was effective.

I like the idea of having a one-size-fits-all setting for this, but whatever I choose I think
that compaction is going force me to revisit it. Right now this setting is part of the schema,
and it's a nested schema setting at that. I'm leaning toward "undocumented-setting" right
now with a reasonable default.

> Redesigned Compaction
> ---------------------
>                 Key: CASSANDRA-1608
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Benjamin Coverston
>         Attachments: 1608-v2.txt, 1608-v8.txt, 1609-v10.txt
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more thinking on
this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the moment, compaction
is kicked off based on a write access pattern, not read access pattern. In most cases, you
want the opposite. You want to be able to track how well each SSTable is performing in the
system. If we were to keep statistics in-memory of each SSTable, prioritize them based on
most accessed, and bloom filter hit/miss ratios, we could intelligently group sstables that
are being read most often and schedule them for compaction. We could also schedule lower priority
maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives us the ability
to  better utilize our bloom filters in a predictable manner. At the moment after a certain
size, the bloom filters become less reliable. This would also allow us to group data most
accessed. Currently the size of an SSTable can grow to a point where large portions of the
data might not actually be accessed as often.

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message