Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E87D945D4 for ; Thu, 23 Jun 2011 13:00:34 +0000 (UTC) Received: (qmail 86721 invoked by uid 500); 23 Jun 2011 13:00:32 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 86635 invoked by uid 500); 23 Jun 2011 13:00:31 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86627 invoked by uid 99); 23 Jun 2011 13:00:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2011 13:00:31 +0000 X-ASF-Spam-Status: No, hits=4.2 required=5.0 tests=FS_LARGE_PERCENT2,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dwilliams@system7.co.uk designates 209.85.210.172 as permitted sender) Received: from [209.85.210.172] (HELO mail-iy0-f172.google.com) (209.85.210.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2011 13:00:24 +0000 Received: by iye7 with SMTP id 7so2137392iye.31 for ; Thu, 23 Jun 2011 06:00:03 -0700 (PDT) Received: by 10.231.83.213 with SMTP id g21mr1722640ibl.100.1308834003185; Thu, 23 Jun 2011 06:00:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.20.2 with HTTP; Thu, 23 Jun 2011 05:59:43 -0700 (PDT) In-Reply-To: References: From: Dominic Williams Date: Thu, 23 Jun 2011 13:59:43 +0100 Message-ID: Subject: Re: 99.999% uptime - Operations Best Practices? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd72ba0d46f0804a660a94f X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd72ba0d46f0804a660a94f Content-Type: text/plain; charset=ISO-8859-1 Les, Cassandra is a good system, but it has not reached version 1.0 yet, nor has HBase etc. It is cutting edge technology and therefore in practice you are unlikely to achieve five nines immediately - even if in theory with perfect planning, perfect administration and so on, this should be achievable even now. The reasons you might choose Cassandra are:- 1. New more flexible data model that may increase developer productivity and lead to fast release cycle 2. Superior capability as concerns being able to *write* large volumes of data, which is incredibly useful in many applications 3. Horizontal scalability, where you can add nodes rather than buying bigger machines 4. Data redundancy, which means you have a kind of live backup going on a bit like RAID - we use replication factor 3 for example 5. Due to the redundancy of data across the cluster, the ability to perform rolling restarts to administer and upgrade your nodes while the cluster continues to run (yes, this is the feature that in theory allows for continual operation, but in practice until we reach 1.0 I don't think five nines of uptime is always possible in every scenario yet because of deficiencies that may present themselves unexpectedly) 6. The benefit of building your new product on a platform designed to solve many modern computing challenges that will give you a better upgrade path e.g. for example in future when you grow you won't have to change over from SQL to NoSQL because you're already on it! These are pretty compelling arguments, but you have to be realistic about where Cassandra is right now. For what it's worth though, you might also consider how easy it is to screw up databases running on commercial production software that are handling very large amounts of data (just let the volumes handling the commit log run short of disk space for example). Setting up a Cassandra cluster is the simplest way to handle big data I've seen and this reduction in complexity will also contribute to uptime. Best, Dominic On 22 June 2011 22:24, Les Hazlewood wrote: > I'm planning on using Cassandra as a product's core data store, and it is > imperative that it never goes down or loses data, even in the event of a > data center failure. This uptime requirement ("five nines": 99.999% uptime) > w/ WAN capabilities is largely what led me to choose Cassandra over other > NoSQL products, given its history and 'from the ground up' design for such > operational benefits. > > However, in a recent thread, a user indicated that all 4 of 4 of his > Cassandra instances were down because the OS killed the Java processes due > to memory starvation, and all 4 instances went down in a relatively short > period of time of each other. Another user helped out and replied that > running 0.8 and nodetool repair on each node regularly via a cron job (once > a day?) seems to work for him. > > Naturally this was disconcerting to read, given our needs for a Highly > Available product - we'd be royally screwed if this ever happened to us. > But given Cassandra's history and it's current production use, I'm aware > that this HA/uptime is being achieved today, and I believe it is certainly > achievable. > > So, is there a collective set of guidelines or best practices to ensure > this problem (or unavailability due to OOM) can be easily managed? > > Things like memory settings, initial GC recommendations, cron > recommendations, ulimit settings, etc. that can be bundled up as a > best-practices "Production Kickstart"? > > Could anyone share their nuggets of wisdom or point me to resources where > this may already exist? > > Thanks! > > Best regards, > > Les > --000e0cd72ba0d46f0804a660a94f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Les,

Cassandra is a good system, but it has not reached = version 1.0 yet, nor has HBase etc. It is cutting edge technology and there= fore in practice you are unlikely to achieve five nines immediately - even = if in theory with perfect planning, perfect administration and so on, this = should be achievable even now.

The reasons you might choose Cassandra are:-
= 1. New more flexible data model that may increase developer productivity an= d lead to fast release cycle
2. Superior capability as concerns b= eing able to *write* large volumes of data, which is incredibly useful in m= any applications
3. Horizontal scalability, where you can add nodes rather than buying = bigger machines
4. Data redundancy, which means you have a kind o= f live backup going on a bit like RAID - we use replication factor 3 for ex= ample
5. Due to the redundancy of data across the cluster, the ability to pe= rform rolling restarts to administer and upgrade your nodes while the clust= er continues to run (yes, this is the feature that in theory allows for con= tinual operation, but in practice until we reach 1.0 I don't think five= nines of uptime is always possible in every scenario yet because of defici= encies that may present themselves unexpectedly)
6. The benefit of building your new product on a platform designed to = solve many modern computing challenges that will give you a better upgrade = path e.g. for example in future when you grow you won't have to change = over from SQL to NoSQL because you're already on it!

These are pretty compelling arguments, but you have to = be realistic about where Cassandra is right now. For what it's worth th= ough, you might also consider how easy it is to screw up databases running = on commercial production software that are handling very large amounts of d= ata (just let the volumes handling the commit log run short of disk space f= or example). Setting up a Cassandra cluster is the simplest way to handle b= ig data I've seen and this reduction in complexity will also contribute= to uptime.=A0

Best, Dominic

= On 22 June 2011 22:24, Les Hazlewood <les@katasoft.com> wrote:
I'm planning on using Cassandra as a product's core data store, and= it is imperative that it never goes down or loses data, even in the event = of a data center failure. =A0This uptime requirement ("five nines"= ;: 99.999% uptime) w/ WAN capabilities is largely what led me to choose Cas= sandra over other NoSQL products, given its history and 'from the groun= d up' design for such operational benefits.

However, in a recent thread, a user indicated that all 4 of = 4 of his Cassandra instances were down because the OS killed the Java proce= sses due to memory starvation, and all 4 instances went down in a relativel= y short period of time of each other. =A0Another user helped out and replie= d that running 0.8 and nodetool repair on each node regularly via a cron jo= b (once a day?) seems to work for him.

Naturally this was disconcerting to read, given our nee= ds for a Highly Available product - we'd be royally screwed if this eve= r happened to us. =A0But given Cassandra's history and it's current= production use, I'm aware that this HA/uptime is being achieved today,= and I believe it is certainly achievable.

So, is there a collective set of guidelines or best pra= ctices to ensure this problem (or unavailability due to OOM) can be easily = managed?

Things like memory settings, initial GC r= ecommendations, cron recommendations, ulimit settings, etc. that can be bun= dled up as a best-practices "Production Kickstart"?

Could anyone share their nuggets of wisdom or point me = to resources where this may already exist?

Thanks!
=
Best regards,

= Les

--000e0cd72ba0d46f0804a660a94f--