cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Les Hazlewood <>
Subject Re: 99.999% uptime - Operations Best Practices?
Date Thu, 23 Jun 2011 00:17:21 GMT
Hi Scott,

First, let me say that this email was amazing - I'm always appreciative of
the time that anyone puts into mailing list replies, especially ones as
thorough, well-thought and articulated as this one.  I'm a firm believer
that these types of replies reflect a strong and durable open-source
community.  You, sir, are a bad ass :)  Thanks so much!

As for the '5 9s' comment, I apologize for even writing that - it threw
everyone off.  It was a shorthand way of saying "this data store is so
critical to the product, that if it ever goes down entirely (as it did for
one user of 4 nodes, all at the same time), then we're screwed."  I was
hoping to trigger the 'hrm - what have we done ourselves to work to that
availability that wasn't easily represented in the documentation' train of
thought.  It proved to be a red herring however, so I apologize for even
bringing it up.

Thanks *very* much for the reply.  I'll be sure to follow up with the list
as I come across any particular issues and I'll also report my own findings
in the interest of (hopefully) being beneficial to anyone in the future.



On Wed, Jun 22, 2011 at 4:58 PM, C. Scott Andreas

> Hi Les,
> I wanted to offer a couple thoughts on where to start and strategies for
> approaching development and deployment with reliability in mind.
> One way that we've found to more productively think about the reliability
> of our data tier is to focus our thoughts away from a concept of "uptime or
> *x* nines" toward one of "error rates." Ryan mentioned that "it depends,"
> and while brief, this is actually a very correct comment. Perhaps I can help
> elaborate.
> Failures in systems distributed across multiple systems in multiple
> datacenters can rarely be described in terms of binary uptime guarantees
> (e.g., either everything is up or everything is down). Instead, certain
> nodes may be unavailable at certain times, but given appropriate read and
> write parameters (and their implicit tradeoffs), these service interruptions
> may remain transparent.
> Cassandra provides a variety of tools to allow you to tune these, two of
> the most important of which are the consistency level for reads and writes
> and your replication factor. I'm sure you're  familiar with these, but
> mention them because thinking hard about the tradeoffs you're willing to
> make in terms of consistency and replication may heavily impact your
> operational experience if availability is of utmost importance.
> Of course, the single-node operational story is very important as well.
> Ryan's "it depends" comment here takes on painful significance for myself,
> as we've found that the manner in which read and write loads vary, their
> duration, and intensity can have very different operational profiles and
> failure modes. If relaxed consistency is acceptable for your reads and
> writes, you'll likely find querying with CL.ONE to be more "available" than
> QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is
> economical for you to provision extra nodes for a higher replication factor,
> you will increase your ability to continue reading and writing in the event
> of single- or multiple-node failures.
> One of the prime challenges we've faced is reducing the frequency and
> intensity of full garbage collections in the JVM, which tend to result in
> single-node unavailability. Thanks to help from Jonathan Ellis and Peter
> Schuller (along with a fair amount of elbow grease ourselves), we've worked
> through several of these issues and have arrived at a steady state that
> leaves the ring happy even under load. We've not found GC tuning to bring
> night-and-day differences outside of resolving the STW collections, but the
> difference is noticeable.
> Occasionally, these issues will result from Cassandra's behavior itself;
> documented APIs such as querying for the count of all columns associated
> with a key will materialize the row across all nodes being queried. Once
> when issuing a "count" query for a key that had around 300k columns at
> CL.QUORUM, we knocked three nodes out of our ring by triggering a
> stop-the-world collection that lasted about 30 seconds, so watch out for
> things like that.
> Some of the other tuning knobs available to you involve tradeoffs such as
> when to flush memtables or to trigger compactions, both of which are
> somewhat intensive operations that can strain a cluster under heavy read or
> write load, but which are equally necessary for the cluster to remain in
> operation. If you find yourself pushing hard against these tradeoffs and
> attempting to navigate a path between icebergs, it's very likely that the
> best answer to the problem is "more or more powerful hardware."
> But a lot of this is tacit knowledge, which often comes through a bit of
> pain but is hopefully operationally transparent to your users.  Things that
> you discover once the system is live in operation and your monitoring is
> providing continuous feedback about the ring's health. This is where Sasha's
> point becomes so critical -- having advanced early-warning systems in place,
> watching monitoring and graphs closely even when everything's fine, and
> beginning to understand how it *likes* to operate and what it tends to do
> will give you a huge leg up on your reliability and allow you to react to
> issues in the ring before they present operational impact.
> You mention that you've been building HA systems for a long time -- indeed,
> far longer than I have, so I'm sure that you're also aware that good, solid
> "up/down" binaries are hard to come by, that none of this is easy, and that
> while some pointers are available (the defaults are actually quite good),
> it's essentially impossible to offer "the best production defaults" because
> they vary wildly based on your hardware, ring configuration, and read/write
> load and query patterns.
> To that end, you might find it more productive to begin with the defaults
> as you develop your system, and let the ring tell you how it's feeling as
> you begin load testing. Once you have stressed it to the point of failure,
> you'll see how it failed and either be able to isolate the cause and begin
> planning to handle that mode, or better yet, understand your maximum
> capacity limits given your current hardware and fire off a purchase order
> the second you see spikes nearing 80% of the total measured capacity in
> production (or apply lessons you've learned in capacity planning as
> appropriate, of course).
> Cassandra's a great system, but you may find that it requires a fair amount
> of active operational involvement and monitoring -- like any distributed
> system -- to maintain in a highly-reliable fashion. Each of those nines
> implies extra time and operational cost, hopefully within the boundaries of
> the revenue stream the system is expected to support.
> Pardon the long e-mail and for waxing a bit philosophical. I hope this
> provides some food for thought.
> - Scott
> ---
> C. Scott Andreas
> Engineer, Urban Airship, Inc.

View raw message