Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 52455 invoked from network); 22 Jun 2010 20:41:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Jun 2010 20:41:51 -0000 Received: (qmail 11988 invoked by uid 500); 22 Jun 2010 20:41:50 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 11893 invoked by uid 500); 22 Jun 2010 20:41:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 11885 invoked by uid 99); 22 Jun 2010 20:41:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Jun 2010 20:41:49 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=RCVD_ILLEGAL_IP,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [216.64.169.22] (HELO pdxmta01.webtrends.com) (216.64.169.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Jun 2010 20:41:42 +0000 Received: from PDXEX01.webtrends.corp (Not Verified[10.61.2.16]) by pdxmta01.webtrends.com with MailMarshal (v6,5,4,7535) id ; Tue, 22 Jun 2010 20:41:20 +0000 Received: from pdxmbx01.WebTrends.corp ([10.61.2.160]) by PDXEX01.webtrends.corp with Microsoft SMTPSVC(6.0.3790.3959); Tue, 22 Jun 2010 20:41:20 +0000 Received: from pdxmbx02.WebTrends.corp ([169.254.1.223]) by pdxmbx01.WebTrends.corp ([169.254.2.197]) with mapi; Tue, 22 Jun 2010 20:41:20 +0000 From: Andrew Psaltis To: "user@cassandra.apache.org" Subject: Cassandra Health Monitoring Thread-Topic: Cassandra Health Monitoring Thread-Index: AQHLEktEpxcVo8UG2022vYAe/RmdeA== Date: Tue, 22 Jun 2010 20:41:17 +0000 Message-ID: <146EDCBA0C0B8C42B3DE45BAFB25FD439214D3@pdxmbx02.WebTrends.corp> References: <203BCADB-9733-44B0-998C-6D8950F0EC2B@thelastpickle.com> <4C20F24F.1030409@digg.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-cr-hashedpuzzle: Cn0J Dru3 EXhM GhR+ Grwe G4N8 IykG I6wT Jgaz JpmQ Jw+Q Jxth LN08 Lgpy Lsh6 MKf0;1;dQBzAGUAcgBAAGMAYQBzAHMAYQBuAGQAcgBhAC4AYQBwAGEAYwBoAGUALgBvAHIAZwA=;Sosha1_v1;7;{8E887D00-9C49-47A0-A43F-78CD770DBA6D};YQBuAGQAcgBlAHcALgBwAHMAYQBsAHQAaQBzAEAAdwBlAGIAdAByAGUAbgBkAHMALgBjAG8AbQA=;Tue, 22 Jun 2010 20:41:17 GMT;QwBhAHMAcwBhAG4AZAByAGEAIABIAGUAYQBsAHQAaAAgAE0AbwBuAGkAdABvAHIAaQBuAGcA x-cr-puzzleid: {8E887D00-9C49-47A0-A43F-78CD770DBA6D} Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginalArrivalTime: 22 Jun 2010 20:41:20.0488 (UTC) FILETIME=[454F8E80:01CB124B] X-Virus-Checked: Checked by ClamAV on apache.org All, We have been working through some operations scenarios, so that we are read= y to deploy our first Cassandra cluster into production =A0in the coming mo= nths. During this process our operations folks have asked us to provide a H= ealth Check service. I am using the word service here very liberally - real= ly we just need to provide a way for the folks in out NOC to know that not = only is the Cassandra process running (which they will get with their monit= oring tools ), but that it is actually alive and well. We do not have the i= ntent of verifying that the data is valid, just that every node in the clus= ter that is known to be running is actually alive and healthy. My questions= are - What does it mean for a Cassandra node to be healthy? =A0What is the= minimum (from an impact to the performance of a node) things we can check = to make sure that a node is not a zombie? Any and all input is greatly appreciated. Thanks, Andrew