cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "C. Scott Andreas" <csco...@urbanairship.com>
Subject Re: 99.999% uptime - Operations Best Practices?
Date Wed, 22 Jun 2011 23:58:58 GMT
Hi Les,

I wanted to offer a couple thoughts on where to start and strategies for approaching development
and deployment with reliability in mind.

One way that we've found to more productively think about the reliability of our data tier
is to focus our thoughts away from a concept of "uptime or x nines" toward one of "error rates."
Ryan mentioned that "it depends," and while brief, this is actually a very correct comment.
Perhaps I can help elaborate.

Failures in systems distributed across multiple systems in multiple datacenters can rarely
be described in terms of binary uptime guarantees (e.g., either everything is up or everything
is down). Instead, certain nodes may be unavailable at certain times, but given appropriate
read and write parameters (and their implicit tradeoffs), these service interruptions may
remain transparent.

Cassandra provides a variety of tools to allow you to tune these, two of the most important
of which are the consistency level for reads and writes and your replication factor. I'm sure
you're  familiar with these, but mention them because thinking hard about the tradeoffs you're
willing to make in terms of consistency and replication may heavily impact your operational
experience if availability is of utmost importance.

Of course, the single-node operational story is very important as well. Ryan's "it depends"
comment here takes on painful significance for myself, as we've found that the manner in which
read and write loads vary, their duration, and intensity can have very different operational
profiles and failure modes. If relaxed consistency is acceptable for your reads and writes,
you'll likely find querying with CL.ONE to be more "available" than QUROUM or ALL, at the
cost of reduced consistency. Similarly, if it is economical for you to provision extra nodes
for a higher replication factor, you will increase your ability to continue reading and writing
in the event of single- or multiple-node failures.

One of the prime challenges we've faced is reducing the frequency and intensity of full garbage
collections in the JVM, which tend to result in single-node unavailability. Thanks to help
from Jonathan Ellis and Peter Schuller (along with a fair amount of elbow grease ourselves),
we've worked through several of these issues and have arrived at a steady state that leaves
the ring happy even under load. We've not found GC tuning to bring night-and-day differences
outside of resolving the STW collections, but the difference is noticeable.

Occasionally, these issues will result from Cassandra's behavior itself; documented APIs such
as querying for the count of all columns associated with a key will materialize the row across
all nodes being queried. Once when issuing a "count" query for a key that had around 300k
columns at CL.QUORUM, we knocked three nodes out of our ring by triggering a stop-the-world
collection that lasted about 30 seconds, so watch out for things like that.

Some of the other tuning knobs available to you involve tradeoffs such as when to flush memtables
or to trigger compactions, both of which are somewhat intensive operations that can strain
a cluster under heavy read or write load, but which are equally necessary for the cluster
to remain in operation. If you find yourself pushing hard against these tradeoffs and attempting
to navigate a path between icebergs, it's very likely that the best answer to the problem
is "more or more powerful hardware."

But a lot of this is tacit knowledge, which often comes through a bit of pain but is hopefully
operationally transparent to your users.  Things that you discover once the system is live
in operation and your monitoring is providing continuous feedback about the ring's health.
This is where Sasha's point becomes so critical -- having advanced early-warning systems in
place, watching monitoring and graphs closely even when everything's fine, and beginning to
understand how it likes to operate and what it tends to do will give you a huge leg up on
your reliability and allow you to react to issues in the ring before they present operational
impact.

You mention that you've been building HA systems for a long time -- indeed, far longer than
I have, so I'm sure that you're also aware that good, solid "up/down" binaries are hard to
come by, that none of this is easy, and that while some pointers are available (the defaults
are actually quite good), it's essentially impossible to offer "the best production defaults"
because they vary wildly based on your hardware, ring configuration, and read/write load and
query patterns.

To that end, you might find it more productive to begin with the defaults as you develop your
system, and let the ring tell you how it's feeling as you begin load testing. Once you have
stressed it to the point of failure, you'll see how it failed and either be able to isolate
the cause and begin planning to handle that mode, or better yet, understand your maximum capacity
limits given your current hardware and fire off a purchase order the second you see spikes
nearing 80% of the total measured capacity in production (or apply lessons you've learned
in capacity planning as appropriate, of course).

Cassandra's a great system, but you may find that it requires a fair amount of active operational
involvement and monitoring -- like any distributed system -- to maintain in a highly-reliable
fashion. Each of those nines implies extra time and operational cost, hopefully within the
boundaries of the revenue stream the system is expected to support.

Pardon the long e-mail and for waxing a bit philosophical. I hope this provides some food
for thought.

- Scott

---

C. Scott Andreas
Engineer, Urban Airship, Inc.
http://www.urbanairship.com

On Jun 22, 2011, at 4:16 PM, Les Hazlewood wrote:

> On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin <woolfel@gmail.com> wrote:
> you have to use multiple data centers to really deliver 4 or 5 9's of service
> 
> We do, hence my question, as well as my choice of Cassandra :)
> 
> Best,
> 
> Les



Mime
View raw message