cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From brian.spind...@gmail.com
Subject Dropping down replication factor
Date Sat, 12 Aug 2017 21:58:12 GMT
Hi folks, hopefully a quick one:

We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's all in one region but
spread across 3 availability zones.  It was nicely balanced with 4 nodes in each.

But with a couple of failures and subsequent provisions to the wrong az we now have a cluster
with : 

5 nodes in az A
5 nodes in az B
2 nodes in az C

Not sure why, but when adding a third node in AZ C it fails to stream after getting all the
way to completion and no apparent error in logs.  I've looked at a couple of bugs referring
to scrubbing and possible OOM bugs due to metadata writing at end of streaming (sorry don't
have ticket handy).  I'm worried I might not be able to do much with these since the disk
space usage is high and they are under a lot of load given the small number of them for this
rack.

Rather than troubleshoot this further, what I was thinking about doing was:
- drop the replication factor on our keyspace to two
- hopefully this would reduce load on these two remaining nodes 
- run repairs/cleanup across the cluster 
- then shoot these two nodes in the 'c' rack
- run repairs/cleanup across the cluster

Would this work with minimal/no disruption? 
Should I update their "rack" before hand or after ?
What else am I not thinking about? 

My main goal atm is to get back to where the cluster is in a clean consistent state that allows
nodes to properly bootstrap.

Thanks for your help in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


Mime
View raw message