incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <>
Subject Re: Read Latency Degradation
Date Sat, 18 Dec 2010 10:24:47 GMT
> How many nodes? 10 - 16 cores each (2 x quad ht cpus)
> How much ram per node? 24gb
> What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0)
> for data
> Is your ring balanced? yes, random partitioned very evenly
> How many column families? 4 CFs x 3 Keyspaces
> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What type of caching are you using? Key caching
> What are the sizes of caches? 500k-1m values for 2 of the CFs
> What is the hit rate of caches? high, 90%+
> What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes
> to 90%+ under heavy read load. CPU load high as well. Latency does not
> change that much for single reads vs. under load (30 threads). We can keep
> current read latency up to 25-30 read threads if no writes or compaction is
> going on. We are worried about what we see in terms of latency for a single
> read.
> What are your average mean|max row size from cfstats? 30k avg/5meg max for
> one CF and 311k avg/855k max for the other.
> On average for a given sstable how large is the data bloom and index files?
> 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter,
> 18meg index for the other.

I do want to make one point very clear: Regardless of whether or not
you're running a perfectly configured Cassandra, or any other data
base, whenever your total data set is beyond RAM you *will* take I/O
load as a function of the locality of data access. There is just no
way around that. If 10% of your read requests go to data that has not
recently been read (the long tail), it won't be cached, and no storage
system ever will magically make that read not go down to disk.

There are definitely things to tweak. For example, making sure that
the memory you do have for caching purposes is used as efficiently as
possible to minimize the amount of reads that go down to disk. But in
order to figure out what is going on you are going to have to gain an
understanding of what the data access pattern is like. For example, if
the hot set is very small and slowly changing, you may be able to have
100 TB per node and take the traffic without any difficulties. On the
other hand if your data is accessed completely randomly, you may have
trouble with a data set just twice the size of RAM (and most requests
and up going down to disk).

In the worst case access patterns where caching is completely
ineffective (such as random access to a data set several times the
size of RAM), it will be solely about disk IOPS and SSD:s are probably
the way to go unless lots and lots of servers become cheaper
(depending on data sizes mostly).

Now, that said, the practical reality with Cassandra is not the
theoretical optimum w.r.t. caching. Some things to consider:

(1) Whatever memory you give the JVM is going to be a direct trade-off
in the sense that said memory is no longer available for the operating
system page cache. If you "waste" memory there for no good reason,
you're just loosing caching potential.

(2) That said, the row cache in Cassandra is not affected by
compaction/repair. Anything cached by the operating system page cache
(anything not in row cache) will be affected, in terms of cache
warmness, by compactions and repair operations. This means you have to
expect a variation in cache locality whenever compaction/repair runs,
in addition to the I/O load caused directly by same. (There is
on-going work to improve this, but nothing usable right now.)

(3) Reads that do go down to disk will go down to all sstables that
contain data for that *row* (the bloom filters are at the row key
level). If you have reads to rows that span multiple sstables (because
they are continually written for example) this means that sstable
counts is very important. If you only write rows once and never touch
them again, it is much much less of an issue.

(4) There is a limit to bloom filter size that means that if you go
above 143 M keys in a single sstable (I believe that is the number
based on IRC conversations, I haven't checked) the bloom filter false
positive rate will increase. Very significantly if you are much much
bigger than 143 M keys. This leads to unnecessary reads from sstables
that do not contain data for the row being requested. Given that you
say you have lots of data, consider this. But note that only the row
*count* matters, not the size (in terms of megabytes) of the sstables.
Remember that on major compactions, all data in a column family is
going to end up in a single sstable, so do consider the total row
count (per node) here, not just what the sstable layout happens to be
be now. (There is on-going work to remove this limitation w.r.t. row

(5) In general the way I/O works, latency will skyrocket once you
start saturating your disks. As long as you're significantly below
full utilization of your disks, you'll see pretty stable and low
latencies. As you approach full saturation, the latencies will tend to
increase super-linearly. Once you're *above* saturation, your
latencies skyrocket and reads are dropped because the rate cannot be
sustained. This means that while latency is a great indicator to look
at to judge what the current user perceived behavior is, it is *not* a
good thing to look at to extrapolate resource demands or figure out
how far you are from saturation / need for more hardware.

/ Peter Schuller

View raw message