Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 243529998 for ; Thu, 8 Dec 2011 02:26:42 +0000 (UTC) Received: (qmail 87784 invoked by uid 500); 8 Dec 2011 02:26:39 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 87758 invoked by uid 500); 8 Dec 2011 02:26:39 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 87750 invoked by uid 99); 8 Dec 2011 02:26:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Dec 2011 02:26:39 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of potekhin@bnl.gov designates 130.199.3.132 as permitted sender) Received: from [130.199.3.132] (HELO smtpgw.bnl.gov) (130.199.3.132) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Dec 2011 02:26:29 +0000 X-BNL-policy-q: X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av8EAOUf4E6CxzYH/2dsb2JhbABDhQalU4EFgXMBBSMVUSUCBSECAg8CRhMIAQGtF5E5gTSGaIICgRYEiC6SB4x3 X-IronPort-AV: E=Sophos;i="4.71,317,1320642000"; d="scan'208";a="155374058" Received: from rcf.rhic.bnl.gov ([130.199.54.7]) by smtpgw.sec.bnl.local with ESMTP/TLS/DHE-RSA-AES256-SHA; 07 Dec 2011 21:26:08 -0500 Received: from [192.168.0.192] (ool-18bde93d.dyn.optonline.net [24.189.233.61]) (authenticated bits=0) by rcf.rhic.bnl.gov (8.13.8/8.13.8) with ESMTP id pB82Q6Hp017062 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 7 Dec 2011 21:26:08 -0500 Message-ID: <4EE0204A.30904@bnl.gov> Date: Wed, 07 Dec 2011 21:26:18 -0500 From: Maxim Potekhin User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Cassandra behavior too fragile? References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org OK, thanks to the excellent help of Datastax folks, some of the more severe inconsistencies in my Cassandra cluster were fixed (after a node was down and compactions failed etc). I'm still having problems as reported in "repairs 0.8.6." thread. Thing is, why is it so easy for the repair process to break? OK, I admit I'm not sure why nodes are reported as "dead" once in a while, but it's absolutely certain that they simply don't fall off the edge, are knocked out for 10 min or anything like that. Why is there no built-in tolerance/retry mechanism so that a node that may seem silent for a minute can be contacted later, or, better yet, a different node with a relevant replica is contacted? As was evident from some presentations at Cassandra-NYC yesterday, failed compactions and repairs are a major problem for a number of users. The cluster can quickly become unusable. I think it would be a good idea to build more robustness into these procedures, Regards Maxim