I've noticed that one of my systems is getting hammered...and that more and more traffic is being sent to the system having trouble.  Looking at LeastLoadedNodeSelector.java I can see why.

LoadLoadedNodeSelector finds the node in the cluster that is least loaded but its calculation of least loaded is based on the number of active connections and ignores failures which tends to cause more connections to be made to the machine that failed on a previous attempt.

Here is the code for the compare function that sorts the list of nodes.  It checks for active count, borrowed count and then lastly corrupted count.   Corrupted count is the interesting one but its almost never gotten to since the borrowed count will almost always differ between the nodes in the cluster.

        public int compareTo(Candidate candidate) {
            int value = numActive - candidate.numActive;

            if (value == 0)
                value = numBorrowed - candidate.numBorrowed;

            if (value == 0)
                value = numCorrupted - candidate.numCorrupted;

            return value;

I've seen this problem with other companies and products: leastloaded as a means of picking servers is almost always liable to death spirals when a server can have a failure.

Is there any way to configure away from this in C*?


Brian Tarbox