cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1882) rate limit all background I/O
Date Tue, 28 Dec 2010 21:57:46 GMT


Peter Schuller commented on CASSANDRA-1882:

(First, haven't done further work yet because I'm away traveling and not really doing development.)

Jake: Thanks. However I'm pretty skeptical as io niceness only gives a very very coarse way
of specifying what you want. So even if it worked beautifully in some particular case, it
won't in others, and there is no good way to control it AFAIK.

For example, the very first test I did (writing at a fixed speed at fixed chunk size concurrently
with seek-bound small reads) failed miserably by completely starving the writes (and this
was *without* ionice)  until I switched away from cfq to noop or deadline because cfq refused
to actually submit I/O requests to the device to do it's own scheduling based on better information
(more on that in a future comment). The support for io nice is specific to cfq btw.

I don't want to talk too many specifics yet because I want to do some more testing and try
a bit harder to make cfq do what I want before I start making claims, but I think that in
general, rate limiting I/O in such a way that you get sufficient throughput while not having
a too adverse effect on foreground reads is going to take some runtime tuning depending on
both workload and hardware (e.g., lone disk vs. 6 disk RAID10 are entirely different matters).
I think that simply telling the kernel to de-prioritize the compaction workload might work
well in some very specific situations (exactly the right kernel version, io scheduler choice/parameters,
workloads and underlying storage device), but not in general. 

More to come. Hopefully with some Python code + sysbench command lines for easy testing by
others on differing hardware setups. (I have not yet tested with a real rate limited cassandra,
but did testing with sysbench for reads and a Python writer doing chunk-size I/O with fsync().
Test done on raid5/raid10 and with xfs and ext4 (not all permutations). While file system
choice impacts somewhat, all results instantly got useless once I realized the I/O scheduling
was orders of magnitude more important.

> rate limit all background I/O
> -----------------------------
>                 Key: CASSANDRA-1882
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>             Fix For: 0.7.1
> There is a clear need to support rate limiting of all background I/O (e.g., compaction,
repair). In some cases background I/O is naturally rate limited as a result of being CPU bottlenecked,
but in all cases where the CPU is not the bottleneck, background streaming I/O is almost guaranteed
(barring a very very smart RAID controller or I/O subsystem that happens to cater extremely
well to the use case) to be detrimental to the latency and throughput of regular live traffic
> Ways in which live traffic is negatively affected by backgrounds I/O includes:
> * Indirectly by page cache eviction (see e.g. CASSANDRA-1470).
> * Reads are directly detrimental when not otherwise limited for the usual reasons; large
continuing read requests that keep coming are battling with latency sensitive live traffic
(mostly seek bound). Mixing seek-bound latency critical with bulk streaming is a classic no-no
for I/O scheduling.
> * Writes are directly detrimental in a similar fashion.
> * But in particular, writes are more difficult still: Caching effects tend to augment
the effects because lacking any kind of fsync() or direct I/O, the operating system and/or
RAID controller tends to defer writes when possible. This often leads to a very sudden throttling
of the application when caches are filled, at which point there is potentially a huge backlog
of data to write.
> ** This may evict a lot of data from page cache since dirty buffers cannot be evicted
prior to being flushed out (though CASSANDRA-1470 and related will hopefully help here).
> ** In particular, one major reason why batter-backed RAID controllers are great is that
they have the capability to "eat" storms of writes very quickly and schedule them pretty efficiently
with respect to a concurrent continuous stream of reads. But this ability is defeated if we
just throw data at it until entirely full. Instead a rate-limited approach means that data
can be thrown at said RAID controller at a reasonable pace and it can be allowed to do its
job of limiting the impact of those writes on reads.
> I propose a mechanism whereby all such backgrounds reads are rate limited in terms of
MB/sec throughput. There would be:
> * A configuration option to state the target rate (probably a global, until there is
support for per-cf sstable placement)
> * A configuration option to state the sampling granularity. The granularity would have
to be small enough for rate limiting to be effective (i.e., the amount of I/O generated in
between each sample must be reasonably small) while large enough to not be expensive (neither
in terms of gettimeofday() type over-head, nor in terms of causing smaller writes so that
would-be streaming operations become seek bound). There would likely be a recommended value
on the order of say 5 MB, with a recommendation to multiply that with the number of disks
in the underlying device (5 MB assumes classic mechanical disks).
> Because of coarse granularity (= infrequent synchronization), there should not be a significant
overhead associated with maintaining shared global rate limiter for the Cassandra instance.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message