Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9980B6EB3 for ; Wed, 22 Jun 2011 21:25:05 +0000 (UTC) Received: (qmail 92550 invoked by uid 500); 22 Jun 2011 21:25:03 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 92483 invoked by uid 500); 22 Jun 2011 21:25:03 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 92475 invoked by uid 99); 22 Jun 2011 21:25:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 21:25:03 +0000 X-ASF-Spam-Status: No, hits=4.9 required=5.0 tests=FS_LARGE_PERCENT2,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.215.172] (HELO mail-ey0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 21:24:55 +0000 Received: by eye13 with SMTP id 13so459679eye.31 for ; Wed, 22 Jun 2011 14:24:33 -0700 (PDT) Received: by 10.14.96.78 with SMTP id q54mr910749eef.157.1308777873383; Wed, 22 Jun 2011 14:24:33 -0700 (PDT) MIME-Version: 1.0 Sender: les.hazlewood@katasoft.com Received: by 10.14.187.137 with HTTP; Wed, 22 Jun 2011 14:24:13 -0700 (PDT) X-Originating-IP: [24.5.183.156] From: Les Hazlewood Date: Wed, 22 Jun 2011 14:24:13 -0700 X-Google-Sender-Auth: RKYWefjpx1qiatjdLviVrVZaPGk Message-ID: Subject: 99.999% uptime - Operations Best Practices? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec52be44d3ba41b04a653984e --bcaec52be44d3ba41b04a653984e Content-Type: text/plain; charset=ISO-8859-1 I'm planning on using Cassandra as a product's core data store, and it is imperative that it never goes down or loses data, even in the event of a data center failure. This uptime requirement ("five nines": 99.999% uptime) w/ WAN capabilities is largely what led me to choose Cassandra over other NoSQL products, given its history and 'from the ground up' design for such operational benefits. However, in a recent thread, a user indicated that all 4 of 4 of his Cassandra instances were down because the OS killed the Java processes due to memory starvation, and all 4 instances went down in a relatively short period of time of each other. Another user helped out and replied that running 0.8 and nodetool repair on each node regularly via a cron job (once a day?) seems to work for him. Naturally this was disconcerting to read, given our needs for a Highly Available product - we'd be royally screwed if this ever happened to us. But given Cassandra's history and it's current production use, I'm aware that this HA/uptime is being achieved today, and I believe it is certainly achievable. So, is there a collective set of guidelines or best practices to ensure this problem (or unavailability due to OOM) can be easily managed? Things like memory settings, initial GC recommendations, cron recommendations, ulimit settings, etc. that can be bundled up as a best-practices "Production Kickstart"? Could anyone share their nuggets of wisdom or point me to resources where this may already exist? Thanks! Best regards, Les --bcaec52be44d3ba41b04a653984e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I'm planning on using Cassandra as a product's core data store, and= it is imperative that it never goes down or loses data, even in the event = of a data center failure. =A0This uptime requirement ("five nines"= ;: 99.999% uptime) w/ WAN capabilities is largely what led me to choose Cas= sandra over other NoSQL products, given its history and 'from the groun= d up' design for such operational benefits.

However, in a recent thread, a user indicated that all 4 of = 4 of his Cassandra instances were down because the OS killed the Java proce= sses due to memory starvation, and all 4 instances went down in a relativel= y short period of time of each other. =A0Another user helped out and replie= d that running 0.8 and nodetool repair on each node regularly via a cron jo= b (once a day?) seems to work for him.

Naturally this was disconcerting to read, given our nee= ds for a Highly Available product - we'd be royally screwed if this eve= r happened to us. =A0But given Cassandra's history and it's current= production use, I'm aware that this HA/uptime is being achieved today,= and I believe it is certainly achievable.

So, is there a collective set of guidelines or best pra= ctices to ensure this problem (or unavailability due to OOM) can be easily = managed?

Things like memory settings, initial GC r= ecommendations, cron recommendations, ulimit settings, etc. that can be bun= dled up as a best-practices "Production Kickstart"?

Could anyone share their nuggets of wisdom or point me = to resources where this may already exist?

Thanks!
=
Best regards,

Les
--bcaec52be44d3ba41b04a653984e--