Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6599E986E for ; Sun, 4 Dec 2011 23:52:09 +0000 (UTC) Received: (qmail 92427 invoked by uid 500); 4 Dec 2011 23:52:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 92402 invoked by uid 500); 4 Dec 2011 23:52:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 92394 invoked by uid 99); 4 Dec 2011 23:52:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Dec 2011 23:52:07 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of potekhin@bnl.gov designates 130.199.3.132 as permitted sender) Received: from [130.199.3.132] (HELO smtpgw.bnl.gov) (130.199.3.132) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Dec 2011 23:51:57 +0000 X-BNL-policy-q: X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ak8FAOIG3E6CxzYH/2dsb2JhbABEhQWjBIIugQWBcgEBBAEjFVELGgIFIQICDwJGEwgBAYgDo2GQZYEwgj2EH4F/gRYEiC2Rdoxv X-IronPort-AV: E=Sophos;i="4.71,295,1320642000"; d="scan'208";a="154783990" Received: from rcf.rhic.bnl.gov ([130.199.54.7]) by smtpgw.sec.bnl.local with ESMTP/TLS/DHE-RSA-AES256-SHA; 04 Dec 2011 18:51:36 -0500 Received: from [192.168.0.196] (ool-18bde93d.dyn.optonline.net [24.189.233.61]) (authenticated bits=0) by rcf.rhic.bnl.gov (8.13.8/8.13.8) with ESMTP id pB4NpZOn000746 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sun, 4 Dec 2011 18:51:35 -0500 Message-ID: <4EDC0783.90509@bnl.gov> Date: Sun, 04 Dec 2011 18:51:31 -0500 From: Maxim Potekhin User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: Repair failure under 0.8.6 References: <4EDAAF7E.40502@bnl.gov> <4EDACE10.6010804@bnl.gov> <4EDBB7A9.9060604@bnl.gov> <4EDBE255.4080905@bnl.gov> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org As a side effect of the failed repair (so it seems) the disk usage on the affected node prevents compaction from working. It still works on the remaining nodes (we have 3 total). Is there a way to scrub the extraneous data? Thanks Maxim On 12/4/2011 4:29 PM, Peter Schuller wrote: >> I will try to increase phi_convict -- I will just need to restart the >> cluster after >> the edit, right? > You will need to restart the nodes for which you want the phi convict > threshold to be different. You might want to do on e.g. half of the > cluster to do A/B testing. > >> I do recall that I see nodes temporarily marked as down, only to pop up >> later. > I recommend grepping through the logs on all the clusters (e.g., cat > /var/log/cassandra/cassandra.log | grep UP | wc -l). That should tell > you quickly whether they all seem to be seeing roughly as many node > flaps, or whether some particular node or set of nodes is/are > over-represented. > > Next, look at the actual nodes flapping (remove wc -l) and see if all > nodes are flapping or if it is a single node, or a subset of the nodes > (e.g., sharing a switch perhaps). > >> In the current situation, there is no load on the cluster at all, outside >> the >> maintenance like the repair. > Ok. So what i'm getting at then is that there may be real legitimate > connectivity problems that you aren't noticing in any other way since > you don't have active traffic to the cluster. > >