cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <>
Subject Re: 99.999% uptime - Operations Best Practices?
Date Thu, 23 Jun 2011 00:36:33 GMT
Attached are the day and week views of one of the cassandra boxes.  I think
it's obvious where the OOM happened :-)

To Thoku's points:
-I'm running with Sun, though I had to explicitly list the RMI hostname (==
the IP address) to allow JMX to work
-I didn't install JNA, should I be worried? :-)
-Amazon larges don't have swap, and I didn't explicitly enable it
-I don't run nodetool repair regularly, but my system is append only, and
the docs seem to indicate repair was for cleaning up deletes

Other things:
-Connecting Jconsole through a firewall is basically impossible.  I finally
had to install a VPN (I used hamachi).
-I use nagios for alerts
-I'm writing my backup procedure now, but my approach is (for AWS):
  nodetool snapshot
  if(ebs.size < snapshot.size) if(ebs is null) create ebs = 2xsize else
resize volume to 2x size
  rsync snapshot to ebs
  take aws snapshot of ebs

I'm sure I'll think of more stuff later.


On Wed, Jun 22, 2011 at 8:17 PM, Les Hazlewood <> wrote:

> Hi Scott,
> First, let me say that this email was amazing - I'm always appreciative of
> the time that anyone puts into mailing list replies, especially ones as
> thorough, well-thought and articulated as this one.  I'm a firm believer
> that these types of replies reflect a strong and durable open-source
> community.  You, sir, are a bad ass :)  Thanks so much!
> As for the '5 9s' comment, I apologize for even writing that - it threw
> everyone off.  It was a shorthand way of saying "this data store is so
> critical to the product, that if it ever goes down entirely (as it did for
> one user of 4 nodes, all at the same time), then we're screwed."  I was
> hoping to trigger the 'hrm - what have we done ourselves to work to that
> availability that wasn't easily represented in the documentation' train of
> thought.  It proved to be a red herring however, so I apologize for even
> bringing it up.
> Thanks *very* much for the reply.  I'll be sure to follow up with the list
> as I come across any particular issues and I'll also report my own findings
> in the interest of (hopefully) being beneficial to anyone in the future.
> Cheers,
> Les
> On Wed, Jun 22, 2011 at 4:58 PM, C. Scott Andreas <
>> wrote:
>> Hi Les,
>> I wanted to offer a couple thoughts on where to start and strategies for
>> approaching development and deployment with reliability in mind.
>> One way that we've found to more productively think about the reliability
>> of our data tier is to focus our thoughts away from a concept of "uptime or
>> *x* nines" toward one of "error rates." Ryan mentioned that "it depends,"
>> and while brief, this is actually a very correct comment. Perhaps I can help
>> elaborate.
>> Failures in systems distributed across multiple systems in multiple
>> datacenters can rarely be described in terms of binary uptime guarantees
>> (e.g., either everything is up or everything is down). Instead, certain
>> nodes may be unavailable at certain times, but given appropriate read and
>> write parameters (and their implicit tradeoffs), these service interruptions
>> may remain transparent.
>> Cassandra provides a variety of tools to allow you to tune these, two of
>> the most important of which are the consistency level for reads and writes
>> and your replication factor. I'm sure you're  familiar with these, but
>> mention them because thinking hard about the tradeoffs you're willing to
>> make in terms of consistency and replication may heavily impact your
>> operational experience if availability is of utmost importance.
>> Of course, the single-node operational story is very important as well.
>> Ryan's "it depends" comment here takes on painful significance for myself,
>> as we've found that the manner in which read and write loads vary, their
>> duration, and intensity can have very different operational profiles and
>> failure modes. If relaxed consistency is acceptable for your reads and
>> writes, you'll likely find querying with CL.ONE to be more "available" than
>> QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is
>> economical for you to provision extra nodes for a higher replication factor,
>> you will increase your ability to continue reading and writing in the event
>> of single- or multiple-node failures.
>> One of the prime challenges we've faced is reducing the frequency and
>> intensity of full garbage collections in the JVM, which tend to result in
>> single-node unavailability. Thanks to help from Jonathan Ellis and Peter
>> Schuller (along with a fair amount of elbow grease ourselves), we've worked
>> through several of these issues and have arrived at a steady state that
>> leaves the ring happy even under load. We've not found GC tuning to bring
>> night-and-day differences outside of resolving the STW collections, but the
>> difference is noticeable.
>> Occasionally, these issues will result from Cassandra's behavior itself;
>> documented APIs such as querying for the count of all columns associated
>> with a key will materialize the row across all nodes being queried. Once
>> when issuing a "count" query for a key that had around 300k columns at
>> CL.QUORUM, we knocked three nodes out of our ring by triggering a
>> stop-the-world collection that lasted about 30 seconds, so watch out for
>> things like that.
>> Some of the other tuning knobs available to you involve tradeoffs such as
>> when to flush memtables or to trigger compactions, both of which are
>> somewhat intensive operations that can strain a cluster under heavy read or
>> write load, but which are equally necessary for the cluster to remain in
>> operation. If you find yourself pushing hard against these tradeoffs and
>> attempting to navigate a path between icebergs, it's very likely that the
>> best answer to the problem is "more or more powerful hardware."
>> But a lot of this is tacit knowledge, which often comes through a bit of
>> pain but is hopefully operationally transparent to your users.  Things that
>> you discover once the system is live in operation and your monitoring is
>> providing continuous feedback about the ring's health. This is where Sasha's
>> point becomes so critical -- having advanced early-warning systems in place,
>> watching monitoring and graphs closely even when everything's fine, and
>> beginning to understand how it *likes* to operate and what it tends to do
>> will give you a huge leg up on your reliability and allow you to react to
>> issues in the ring before they present operational impact.
>> You mention that you've been building HA systems for a long time --
>> indeed, far longer than I have, so I'm sure that you're also aware that
>> good, solid "up/down" binaries are hard to come by, that none of this is
>> easy, and that while some pointers are available (the defaults are actually
>> quite good), it's essentially impossible to offer "the best production
>> defaults" because they vary wildly based on your hardware, ring
>> configuration, and read/write load and query patterns.
>> To that end, you might find it more productive to begin with the defaults
>> as you develop your system, and let the ring tell you how it's feeling as
>> you begin load testing. Once you have stressed it to the point of failure,
>> you'll see how it failed and either be able to isolate the cause and begin
>> planning to handle that mode, or better yet, understand your maximum
>> capacity limits given your current hardware and fire off a purchase order
>> the second you see spikes nearing 80% of the total measured capacity in
>> production (or apply lessons you've learned in capacity planning as
>> appropriate, of course).
>> Cassandra's a great system, but you may find that it requires a fair
>> amount of active operational involvement and monitoring -- like any
>> distributed system -- to maintain in a highly-reliable fashion. Each of
>> those nines implies extra time and operational cost, hopefully within the
>> boundaries of the revenue stream the system is expected to support.
>> Pardon the long e-mail and for waxing a bit philosophical. I hope this
>> provides some food for thought.
>> - Scott
>> ---
>> C. Scott Andreas
>> Engineer, Urban Airship, Inc.

Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835

View raw message