cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1658) support incremental sstable switching
Date Thu, 25 Nov 2010 19:27:14 GMT


Peter Schuller commented on CASSANDRA-1658:

At first I really really liked it, but then I realized a problem that takes a way a little
bit of it and now I'm not sure. Anyways, firstly what I like: Couple this with posix_fadvise()/DONTNEED
on the sstable's being switched *from*, and one would not even have to have memory for both
sets of sstables in order to remain hot in cases where you rely on a cf being mostly or completely
in memory.

The posix_fadvise() (and munlock() if mlock():ed sstables come into the picture in the future)
would presumably be done at some granularity higher than rows or calls would be much too frequent
for performance purposes. But doing so every few tens of MB:s or something should be fine.

In addition, on the topic of rate limiting, fsync():ing would still be required for rate limiting
purposes under some circumstances to avoid affecting read latencies too much (to avoid e.g.
the OS pushing out more than fits in battery-backed cache on a RAID controller as a result
of pushing data in bursts).

A big downside though is this: For workloads where performance is dependent on the warmness
of the cache with respect to the active set, this way of doing it would *still* imply most
of the negative effects of mass cache eviction. Any large cf with a significant warm-up period
would be highly effected.

A possible way to categorize a cf might be:

(1) Very small cf; fits in RAM with lots of margin.
(2) Smallish, just barely fits in RAM.
(3) Large; a lot larger than RAM.

On the premise that we're discussing situations where cache warmth is relevant the following
disposition of the above cf:s with respect to an incremental switch-over:

(1) Works, but doesn't matter much since it fits in RAM anyway (except for muliple such sstables,
but then see (2))
(2) Here we improve significantly by allowing us to lower the constant factor of RAM required
relative to domain data size.
(3) Doesn't work anyway due to eviction on writes.

So really, it seems to me that for situations where you need a reasonably high rate of compaction,
it would only work very well in (2) which is sort of a special case sitting in the middle
on a spectrum.

You do point out that slow compaction is a potential helper here, and I agree. Provided that
compaction is sufficiently slow that the warm-up period of the node is similar or less to
the time spent compacting, this would indeed work well even in the case of (3).

I would further suggest that if you are IOPS sensitive you probably have a strong desire to
limit compaction rate to something reasonable anyway.

It's not clear to me whether the trade-offs would tend to land on the side of it working well
in practice or not.

A reasonably realistic example of type (3) with concrete numbers (let me know if I'm taking
a mis-step in the calculations):

(I am about to engage on some pretty speculative stuff that terminates with insufficient math
skills on my part; you may want to just skip the remainder of this comment.)

Say you have a 500 GB CF, and 16 GB of page cache in the OS. Say you have a warm-up period
of 30 minutes on a completely cold start before you're comfortable taking the load. Assume
that you don't want more than a ~ 25% impact in terms of cold IOPS during a compaction, relative
to the level of warmness you reach after your 30 minute warm-up on node start.

Eviction will tend to be random relative to the frequency/recency of access. So an instant
eviction of some percentage of page cache should result in a proportional (by some factor)
percentage of IOPS.

Assume that your workload is such that 90% of reads are served from cache. This should mean
that the factor in question is 10. I.e., a 10% eviction should result in a 100% increase in

Now, if over time cache hit rates increased linearly this would mean that a 25% target IOPS
increase during compaction translates into a 2.5% maximum eviction rate over the 30 minute
time window. But here is where we become dependent on the distribution of reads and unfortunately
where my math skills fail me.

But at least in the worst possible (unrealistic) case, those 2.5% over 30 minutes translates,
with a 16 GB page cache, into 400 MB/30 minutes. Compacting 500 GB would thus take 26 days.
Of course this is utterly unrealistic but should be an upper bound. Anyone with more math
skills want to chime in on the expected behavior given a long-tail distribution (for example)
where 30 minutes translates into the 90% hit rate?

> support incremental sstable switching
> -------------------------------------
>                 Key: CASSANDRA-1658
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Priority: Minor
> I have been thinking about how to minimize the impact of compaction further beyond CASSANDRA-1470.
1470 deals with the impact of the compaction process itself in that it avoids going through
the buffer cache; however, once compaction is complete you are still switching to new sstables
which will imply cold reads.
> Instead of switching all at once, one could keep both the old and new sstables around
for a bit and incrementally switch over traffic to the new sstables.
> A given request would go to the new or old sstable depending on e.g. the hash of the
row key couple with the point in time relative to compaction completion and relative to the
intended target sstable switch-over.
> In terms of end-user configuration/mnemonics, one would specify, for a given column family,
something like "sstable transition period per gb of data" or similar. The "per gb of data"
would refer to the size of the newly written sstable after a compaction. So; for a major compaction
you would wait for a very significant period of time since the entire database just went cold.
For a minor compaction, you would only wait for a short period of time.
> The result should be a reasonable negative impact on e.g. disk space usage, but hopefully
a very significant impact in terms of making the sstable transition as smooth as possible
for the node.
> I like this because it feels pretty simple, is not relying on OS specific features or
otherwise rely on specific support from the OS other than a "well functioning cache mechanism",
and does not imply something hugely significant like writing our own page cache layer. The
performance w.r.t. CPU should be very small, but the improvement in terms of disk I/O should
be very significant for workloads where it matters.
> The feature would be optional and per-sstable (or possibly global for the node).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message