cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn Hegerfors (JIRA) <>
Subject [jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
Date Fri, 03 Oct 2014 16:02:37 GMT


Björn Hegerfors commented on CASSANDRA-6602:

OK, so a writeup on some implementation details of this strategy might be in place. Maybe
there's a better place to put something as long as this.

But first, thanks for the patch Marcus! Everything looks good.

Now, Marcus mentioned worried about sliding windows as far as SSTable candidate selection
goes. Or at least that's what it sounded like. That was indeed one of my main concerns when
I started implementing DTCS. After all, time is constantly moving, so how do we avoid this
situation where some SSTables compact and then immediately slide over into some "SSTables
older than x" time window and need to compact again. Or maybe before the first compaction,
a few of the oldest of those SSTables managed to slip over already, causing the window for
bigger SSTables to reach more than min_compaction_threshold and triggering a compaction with
mixed size SSTables. Or something else like that.

It can easily get messy, so I aimed for an algorithm with as little of a "sliding windows"
effect as possible. What I came up with is really all summed up by the Target class. A target
represents a time span between Unix epoch* and now. An SSTable is "on target" if its <b>min</b>
timestamp is within a given target's time span.

This is the first decision that I would like some input on. My thinking was that if there's
a big SSTable with timestamps ranging from old to new, picking <b>max</b> timestamp
would've made this data participate in too frequent compactions, because SSTables considered
new compact more often than those considered old. STCS creates these kinds of SSTables, if
you run DTCS from the start, it won't matter what timestamp you chose as the representative
timestamp of an SSTable. So choosing <b>min</b> timestamp was for improved compatibility
with STCS, essentially.

How these targets are chosen is the most interesting part. I tend to think targets backwards
in time. The first target is the one covering most recent timestamps. The most naive idea
would be to let the first target be the last hour, the next target the hour before, and so
on for 24 hours. Then target #25 would be the day from 24 hours ago to 48 hours ago. 6 of
those would get us a week back. Then the same thing for months and years, etc. This approach
has a number of issues. First, it needs a concept of a calendar that I would rather not hard
code as everybody won't want it that way. Plus, it would compact by very varying factors like
24, 7 and 4 (but a month is not exactly 4 weeks, except Fabruary, sometimes... ugh!). So I
quickly ditched the calendar for a configurable factor, called "base", and an initial target
size that I call "time_unit". You could set base to 3 and timeUnit to an hour for example,
which should be expressed in microseconds (if microseconds is what your timestamps use. I
recommend sticking with microseconds, which appears to be a CQL standard. Timestamp inconsistencies
mess with DTCS, which I've involuntarily experienced in my tests). Then the calendar-less
version of the above would pick last hour as the initial target and 2 more for the hours before
that. Then before that, 2 3-hour targets, then 2 9-hour targets, etc. I then noticed that
"base" mostly duplicated the role of min_compaction_threshold, so I removed it in favor of
that. Any input on that choice?

The above solution is similar to my final algorithm, but if you noticed my wording, you'll
see that the above suffers from the "sliding windows" issue. It's most probably a bad idea
to use something like "last hour" or "between 18 hours ago and 9 hours ago" as targets, since
those would be moving targets. Some integer arithmetic fit perfectly as a way around this.
Let "now" be the current time (I get it by taking the greatest max timestamp across all SSTables,
any input on that?). Let's keep timeUnit at an hour and base (min_compaction_threshold) at
3. Using integer division, now/timeUnit will yield the same value for an hour. Now we can
use that as our first target. Unless we recently passed an hour threshold, (now - 20 seconds)/timeUnit
will have the same result. We can't simply use the old approach of making 2 single-hour targets
before that, etc. because that will just lead to the same sliding windows, but with considerably
lower resolution (certainly better than before, though). Rather, consider what happens if
you integer divide this initial target with base. Then we get a new target representing the
3 hour time span, perfectly aligned on 3 hour borders, that contains the current hour. The
current hour could be either the 1st, 2nd or 3rd one of those hours (nothing in-between).
You can continue like that, dividing by base again to get the perfectly aligned 9 hour time
span containing the aforementioned 3 hour span in one of three slots.

This could be the sequence of targets to use. I chose not to use this, and this is again somewhere
I could use a second opinion. Instead, I let the initial target t0 = now/timeUnit (integer
division is implicit), just like before, but the next target is t1 = (t0/base) - 1. That is,
the perfectly aligned 3 hour time span right before the one that t0 was in. Then t2 = (t1/base)
- 1 and so on. With this change, new SSTables aren't candidates for all targets after the
first one they "hit", as it would have been without this last change.

Now we're almost there. The first odd thing about this sequence of targets is that there are
now gaps between targets. If t0 is not the first hour of a perfectly aligned 3 hour time span
(in integer arithmetic terms: if (t0 % base > 0)), then the next target t1 (a 3 hour span),
will not end right before t0. I thought it made sense to fill in those gaps. So in those cases,
when (tn % base > 0), I did it in the simplest way, by letting the next target t(n+1) have
the same size as the current target, but moved one step earlier. Essentially t(n+1) = tn -

This is exactly the way DTCS is implemented right now. Whenever at least min_compaction_threshold
SSTables hit the same target, they get submitted for compaction. If multiple targets succeed,
then first (i.e. latest) one is picked. In the first patch that I submitted here, there was
one thing that was done differently. Then the initial target was (now/timeUnit) - 1, meaning
that the current hour was not included in any target. I'm no expert at which of these is best,
but difference is that the approach of that patch let new SSTables queue up until we passed
an hour (or whatever timeUnit is set to) border. Then they would all compact at once. The
current implementation favors compacting and recompacting them as much as possible for the
first hour (but not if they're less than min_compaction_threshold).

A side effect of this is that if min_compaction_threshold is, say 4, you might have 1 big
and 2 small SSTables from the last hour, the moment time (more specifically, the "now" variable)
crosses over to a new hour, then it will stay uncompacted like that until those SSTables enter
a bigger target. At that point, what you would have expected to be 4 similarly sized SSTables
in that new target would rather be 3 similarly sized ones, 1 a tad smaller, and 2 small ones
(actually the same thing is likely to have happened during other hours too). But after that
compaction happens, everything is like it should be**. I don't believe that it's a big deal.
There are a couple ways around it. The (now/timeUnit) - 1 initial target approach fixes this.
Another way would be to ignore min_compaction_threshold for anything beyond how I use it as
a "base", or keeping base and min_compaction_threshold separate, letting you set min_compaction_threshold
to 2. Does anyone have any ideas about this?

>From what I remember right now, the last tweak that I've been thinking about is the following.
Let's use our base=3, timeUnit=hour example from before. If we have (t0 % base == 0), we know
that we're at a 3 hour border. Right now that means that the next target t1 = (t0/base) -
1. But what if we're actually also at a 9 hour border? Then we could make the next target
9 times bigger and put is as (t0/(base^2)) - 1 instead. But we might even be at a 27 hour
border. I hope you see where this is going? Using this knowledge, we could more aggressively
compact SSTables up to bigger sizes at an earlier point. That tweak might be worth considering.

I'll end this detailed explanation with a simple visualization that might help in getting
some intuition about how the targets "move". Here, base is 4 and we let time go from time
0 to 20. On each line, one timeUnit has passed, what you see is a list of targets and which
timestamps they cover (after division by timeUnit).

Sliding windows approach:

Current implementation:

More aggressive tweak:

Hope that enlightens more than it blinds! It would be great to hear some feedback on this.
More might be said about the effects of using DTCS, what it's good for, etc. But that's another

*  A possible extension to DTCS is to add an option (default = 0L) for a time that should
be considered time 0, rather than the Unix epoch.
** Slight reservation, max_compaction_threshold should be at least min_compaction_threshold^2,
to be absolutely sure.

> Compaction improvements to optimize time series data
> ----------------------------------------------------
>                 Key: CASSANDRA-6602
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Björn Hegerfors
>              Labels: compaction, performance
>             Fix For: 3.0
>         Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt,,
cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt,
> There are some unique characteristics of many/most time series use cases that both provide
challenges, as well as provide unique opportunities for optimizations.
> One of the major challenges is in compaction. The existing compaction strategies will
tend to re-compact data on disk at least a few times over the lifespan of each data point,
greatly increasing the cpu and IO costs of that write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition) is laid out
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away all three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained exceptions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need
for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than
disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that could be extremely
efficient for time-series use cases that follow the above pattern.
> (For context, I'm thinking of an example use case involving lots of streams of time-series
data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an
eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doable) for the
table, and that memtable is sized to be able to hold a lengthy window of time. A typical period
might be one day. At the end of that period, you flush the contents of the memtable to an
sstable and move to the next one. This is basically identical to current behaviour, but with
thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question
is whether predictable intervals is actually necessary, or whether just waiting until the
huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped
once all of the columns have. (Another side note, it might be valuable to have a modified
version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required
that all columns have the same TTL)
> 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all data earlier
than a particular timestamp, resulting in immediate dropping of obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid out optimally
on disk on the first pass, and over the course of 1000 days and 5TB of data, there will "only"
be 1000 5GB sstables, so the number of filehandles will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the extended (24
hour+) memtable flush times and merged correctly automatically. For those that were slightly
askew at flush time, or were delivered so far out of order that they go in the wrong sstable,
there is relatively low overhead to reading from two sstables for a time slice, instead of
one, and that overhead would be incurred relatively rarely unless out-of-order delivery was
the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintain more than
one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always
insert into whichever one of the two "owns" the appropriate range of time. By delaying flushing
the ahead one until we are ready to roll writes over to a third one, we are able to avoid
any fragmentation as long as all deliveries come in no more than 12 hours late (in this example,
presumably tunable).
> Anything that triggers compactions will have to be looked at, since there won't be any.
The one concern I have is the ramificaiton of repair. Initially, at least, I think it would
be acceptable to just write one sstable per repair and not bother trying to merge it with
other sstables.

This message was sent by Atlassian JIRA

View raw message