Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of potekhin@bnl.gov designates
 130.199.3.132 as permitted sender)
Message-ID: <4EE0204A.30904@bnl.gov>
Date: Wed, 07 Dec 2011 21:26:18 -0500
From: Maxim Potekhin <potekhin@bnl.gov>
User-Agent: Mozilla/5.0 (Windows NT 6.0;
 rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Cassandra behavior too fragile?
References: 
 <CABsHg76w_x-i82EQWeBUtaQ0HYyM_A88Mn_b3ka16HLH-xJ1NQ@mail.gmail.com>
 <CAN3gsOxPPMmHPmn1D+GsbcssDZw9BFogykyTi2hOwZtTOR9vQw@mail.gmail.com>
 <CABsHg75wy7oFnO0B5mo9Y2KBeVh4-hr5UKvEBGRexBYDw-Es4w@mail.gmail.com>
In-Reply-To: 
 <CABsHg75wy7oFnO0B5mo9Y2KBeVh4-hr5UKvEBGRexBYDw-Es4w@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

OK, thanks to the excellent help of Datastax folks, some of the more 
severe inconsistencies in my Cassandra cluster were fixed (after a node 
was down and compactions failed etc).

I'm still having problems as reported in "repairs 0.8.6." thread.

Thing is, why is it so easy for the repair process to break? OK, I admit 
I'm not sure why nodes are reported as "dead" once in a while, but it's 
absolutely certain that they simply don't fall off the edge, are knocked 
out for 10 min or anything like that. Why is there no built-in 
tolerance/retry mechanism so that a node that may seem silent for a 
minute can be contacted later, or, better yet, a different node with a 
relevant replica is contacted?

As was evident from some presentations at Cassandra-NYC yesterday, 
failed compactions and repairs are a major problem for a number of 
users. The cluster can quickly become unusable. I think it would be a 
good idea to build more robustness into these procedures,

Regards

Maxim