Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 54377 invoked from network); 12 Nov 2010 01:46:22 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Nov 2010 01:46:22 -0000 Received: (qmail 21540 invoked by uid 500); 12 Nov 2010 01:46:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 21510 invoked by uid 500); 12 Nov 2010 01:46:51 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 21502 invoked by uid 99); 12 Nov 2010 01:46:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Nov 2010 01:46:51 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rev.chip@gmail.com designates 209.85.213.172 as permitted sender) Received: from [209.85.213.172] (HELO mail-yx0-f172.google.com) (209.85.213.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Nov 2010 01:46:40 +0000 Received: by yxi11 with SMTP id 11so426587yxi.31 for ; Thu, 11 Nov 2010 17:46:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:content-type :content-transfer-encoding; bh=eaEz+BRWLzbDbyl3I4Bdnf8JdgMxde9OG8l/ZiNOXeo=; b=wN4GM0pFFPNVCaLVaELbIui1EYDRgJZobGpLor+Tuw829+bMOqIq78ZS7mWoyu+c24 F96odyMDLFN3eRTnu7JJNfx0ICsYBVYmrXGEsU556+/IRrwdrBhiLXViVHHJjkkmRDQY j+HiIShfnRHnQfkBGJz/72jhOEMsKBCA3uSCc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; b=ld3SrUoAwq/8lUudN/YODzZTJ9QU9xx4/YitLIuYFkwozDB8i0dfLjdSxYvq9pCEOC AplNMuJYLQDNlQk1j+WaRsmXIJXJYnAIxNqFEPn3RBWNmNxdOJcMCTElCPJeBtK6B9Wh fc3TJLM/p29unZiHhwaq1O7+3ON6SjSkBUbB4= Received: by 10.90.3.17 with SMTP id 17mr2278406agc.194.1289526379783; Thu, 11 Nov 2010 17:46:19 -0800 (PST) Received: from [184.194.101.15] (184-194-101-15.pools.spcsdns.net [184.194.101.15]) by mx.google.com with ESMTPS id g29sm3018229anh.16.2010.11.11.17.46.18 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 11 Nov 2010 17:46:19 -0800 (PST) Message-ID: <4CDC9C61.6030300@gmail.com> Date: Thu, 11 Nov 2010 17:46:09 -0800 From: Reverend Chip User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Cluster fragility Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I've been running tests with a first four-node, then eight-node cluster. I started with 0.7.0 beta3, but have since updated to a more recent Hudson build. I've been happy with a lot of things, but I've had some really surprisingly unpleasant experiences with operational fragility. For example, when adding four nodes to a four-node cluster (at 2x replication), I had two nodes that insisted they were streaming data, but no progress was made in the stream for over a day (this was with beta3). I had to reboot the cluster to clear that condition. For the purpose of making progress on other tests I decided just to reload the data at eight-wide (with the more recent build), but if I had data I couldn't reload or the cluster were serving in production, that would have been a very inconvenient failure. I also had a node that refused to bootstrap immediately, but after I waited a day, it finally got its act together. I write this, not to complain per se, but to ask whether these failures are known & expected, and rebooting a cluster is just a Thing You Have To Do once in a while; or if not, what techniques can be used to clear such cluster topology and streaming/replication problems without rebooting.