We've been seeing this warning on one of our clusters:
2015-10-18 14:28:52,898 WARN [ValidationExecutor:14] org.apache.cassandra.db.context.CounterContext invalid global counter shard detected; (4aa69016-4cf8-4585-8f23-e59af050d174, 1, 67158) and (4aa69016-4cf8-4585-8f23-e59af050d174, 1, 21486) differ only in count; will pick highest to self-heal on compaction
From what I've read and heard in the IRC channel, this warning could be related to not running upgradesstables after upgrading from 2.0.x to 2.1.x. I don't think we ran that then, but we've been at 2.1 since last November. Looking back, the warnings start appearing around June, when no maintenance had been performed on the cluster. At that time, we had been on 2.1.3 for a couple of months. We've been on 2.1.10 for the last week (the upgrade was when we noticed this warning for the first time).
From a suggestion in IRC, I went ahead and ran upgradesstables on all the nodes. Our weekly repair also ran this morning. But the warnings still show up throughout the day.
So, we have many questions:
- How much should we be freaking out?
- Why is this recurring? If I understand what's happening, this is a self-healing process. So, why would it keep happening? Are we possibly using counters incorrectly?
- What does it even mean that there were multiple shards for the same counter? How does that situation even occur?
We're pretty lost here, so any help would be greatly appreciated.