Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of potekhin@bnl.gov designates
 130.199.3.132 as permitted sender)
Message-ID: <4EDC0783.90509@bnl.gov>
Date: Sun, 04 Dec 2011 18:51:31 -0500
From: Maxim Potekhin <potekhin@bnl.gov>
User-Agent: Mozilla/5.0 (Windows NT 6.0;
 rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: Repair failure under 0.8.6
References: 
 <CACCYQcyG2rmJJzxQ_5YYYp8itxvX1pjey-dtXn1MS35dqjL2Lw@mail.gmail.com>
 <4EDAAF7E.40502@bnl.gov>
 <CAO5xsd3c8jk3BzKj2tsJXziKjO0PtJp53AiHj==+ZvC796JH1w@mail.gmail.com>
 <4EDACE10.6010804@bnl.gov> <4EDBB7A9.9060604@bnl.gov>
 <CAO5xsd2YVVKCZB9eHDZY9zZj6wjyhcbYCPKFCy3Yrnu43Yz5gg@mail.gmail.com>
 <4EDBE255.4080905@bnl.gov>
 <CAO5xsd0ZBzQ=XnE933O6sCC8SJwFBSCYX_zgsMSsqbNJJNyfQg@mail.gmail.com>
In-Reply-To: 
 <CAO5xsd0ZBzQ=XnE933O6sCC8SJwFBSCYX_zgsMSsqbNJJNyfQg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

As a side effect of the failed repair (so it seems) the disk usage on the
affected node prevents compaction from working. It still works on
the remaining nodes (we have 3 total).
Is there a way to scrub the extraneous data?

Thanks

Maxim


On 12/4/2011 4:29 PM, Peter Schuller wrote:

>> I will try to increase phi_convict -- I will just need to restart the
>> cluster after
>> the edit, right?
> You will need to restart the nodes for which you want the phi convict
> threshold to be different. You might want to do on e.g. half of the
> cluster to do A/B testing.
>
>> I do recall that I see nodes temporarily marked as down, only to pop up
>> later.
> I recommend grepping through the logs on all the clusters (e.g., cat
> /var/log/cassandra/cassandra.log | grep UP | wc -l). That should tell
> you quickly whether they all seem to be seeing roughly as many node
> flaps, or whether some particular node or set of nodes is/are
> over-represented.
>
> Next, look at the actual nodes flapping (remove wc -l) and see if all
> nodes are flapping or if it is a single node, or a subset of the nodes
> (e.g., sharing a switch perhaps).
>
>> In the current situation, there is no load on the cluster at all, outside
>> the
>> maintenance like the repair.
> Ok. So what i'm getting at then is that there may be real legitimate
> connectivity problems that you aren't noticing in any other way since
> you don't have active traffic to the cluster.
>
>