cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-1608) Redesigned Compaction
Date Tue, 10 May 2011 18:47:47 GMT


Jonathan Ellis commented on CASSANDRA-1608:

Here is the approach used by LevelDB (

"The set of sorted tables are organized into a sequence of levels. The sorted table generated
from a [flush] is placed in a special young level (also called level-0). When the number of
young files exceeds a certain threshold (currently four), all of the young files are merged
together with all of the overlapping level-1 files to produce a sequence of new level-1 files
(we create a new level-1 file for every 2MB of data.)

"Files in the young level may contain overlapping keys. However files in other levels have
distinct non-overlapping key ranges. Consider level number L where L >= 1. When the combined
size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...),
one file in level-L, and all of the overlapping files in level-(L+1) are merged to form a
set of new files for level-(L+1). These merges have the effect of gradually migrating new
updates from the young level to the largest level using only bulk reads and writes (i.e.,
minimizing expensive seeks)

"When the size of level L exceeds its limit, we compact it in a background thread. The compaction
picks a file from level L and all overlapping files from the next level L+1. Note that if
a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is
used as an input to the compaction and will be discarded after the compaction. Aside: because
level-0 is special (files in it may overlap each other), we treat compactions from level-0
to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some
of these files overlap each other.

"A compaction merges the contents of the picked files to produce a sequence of level-(L+1)
files. We switch to producing a new level-(L+1) file after the current output file has reached
the target file size (2MB). We also switch to a new output file when the key range of the
current output file has grown enough to overlap more then ten level-(L+2) files. This last
rule ensures that a later compaction of a level-(L+1) file will not pick up too much data
from level-(L+2).

"Compactions for a particular level rotate through the key space. In more detail, for each
level L, we remember the ending key of the last compaction at level L. The next compaction
for level L will pick the first file that starts after this key (wrapping around to the beginning
of the key space if there is no such file).

"Level-0 compactions will read up to four 1MB files from level-0, and at worst all the level-1
files (10MB). I.e., we will read 14MB and write 14MB.

"Other than the special level-0 compactions, we will pick one 2MB file from level L. In the
worst case, this will overlap ~ 12 files from level L+1 (10 because level-(L+1) is ten times
the size of level-L, and another two at the boundaries since the file ranges at level-L will
usually not be aligned with the file ranges at level-L+1). The compaction will therefore read
26MB and write 26MB. Assuming a disk IO rate of 100MB/s (ballpark range for modern drives),
the worst compaction cost will be approximately 0.5 second.

"If we throttle the background writing to something small, say 10% of the full 100MB/s speed,
a compaction may take up to 5 seconds. If the user is writing at 10MB/s, we might build up
lots of level-0 files (~50 to hold the 5*10MB). This may signficantly increase the cost of
reads due to the overhead of merging more files together on every read.

Then there is some discussion on possible solutions to this problem.

> Redesigned Compaction
> ---------------------
>                 Key: CASSANDRA-1608
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more thinking on
this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the moment, compaction
is kicked off based on a write access pattern, not read access pattern. In most cases, you
want the opposite. You want to be able to track how well each SSTable is performing in the
system. If we were to keep statistics in-memory of each SSTable, prioritize them based on
most accessed, and bloom filter hit/miss ratios, we could intelligently group sstables that
are being read most often and schedule them for compaction. We could also schedule lower priority
maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives us the ability
to  better utilize our bloom filters in a predictable manner. At the moment after a certain
size, the bloom filters become less reliable. This would also allow us to group data most
accessed. Currently the size of an SSTable can grow to a point where large portions of the
data might not actually be accessed as often.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message