cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: Cluster fragility
Date Sat, 13 Nov 2010 02:38:59 GMT
These are not expected.  In order of increasing utility of fixing it
we could use

 - INFO level logs from when something went wrong; when streaming,
both source and target
 - DEBUG level logs
 - instructions for how to reproduce

On Thu, Nov 11, 2010 at 7:46 PM, Reverend Chip <> wrote:
> I've been running tests with a first four-node, then eight-node
> cluster.  I started with 0.7.0 beta3, but have since updated to a more
> recent Hudson build.  I've been happy with a lot of things, but I've had
> some really surprisingly unpleasant experiences with operational fragility.
> For example, when adding four nodes to a four-node cluster (at 2x
> replication), I had two nodes that insisted they were streaming data,
> but no progress was made in the stream for over a day (this was with
> beta3).  I had to reboot the cluster to clear that condition.  For the
> purpose of making progress on other tests I decided just to reload the
> data at eight-wide (with the more recent build), but if I had data I
> couldn't reload or the cluster were serving in production, that would
> have been a very inconvenient failure.
> I also had a node that refused to bootstrap immediately, but after I
> waited a day, it finally got its act together.
> I write this, not to complain per se, but to ask whether these failures
> are known & expected, and rebooting a cluster is just a Thing You Have
> To Do once in a while; or if not, what techniques can be used to clear
> such cluster topology and streaming/replication problems without rebooting.

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support

View raw message