From Mina Naguib <>
Subject Peculiar imbalance affecting 2 machines in a 6 node cluster
Date Tue, 09 Aug 2011 23:24:16 GMT
Hi everyone

I'm observing a very peculiar type of imbalance and I'd appreciate any help or ideas to try.
 This is on cassandra 0.7.8.

The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% each and each
holding roughly 34G.

Then, I added to it 3 machines in the LA data center.  The ring is currently as follows (IP
addresses redacted for clarity):

Address         Status State   Load            Owns    Token                             
IPLA1           Up     Normal  34.57 GB        11.11%  0                                 
IPMTL1          Up     Normal  34.43 GB        22.22%  37809151880104273718152734159085356828
IPLA2           Up     Normal  17.55 GB        11.11%  56713727820156410577229101238628035242
IPMTL2          Up     Normal  34.56 GB        22.22%  94522879700260684295381835397713392071
IPLA3           Up     Normal  51.37 GB        11.11%  113427455640312821154458202477256070485
IPMTL3          Up     Normal  34.71 GB        22.22%  151236607520417094872610936636341427313

The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in yet another
data center, but they're not ready yet to join the cluster.  Once that third DC joins all
nodes will be at 11.11%. However, I don't think this is related.

The problem I'm currently observing is visible in the LA machines, specifically IPLA2 and
IPLA3.  IPLA2 has 50% the expected volume, and IPLA3 has 150% the expected volume.

Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 LA nodes:
34.57 17.55 51.37
(the same 2:1:3 ratio is reflected in our internal tools trending reads/second and writes/second)

I've tried several iterations of compactions/cleanups to no avail.  In terms of config this
is the main keyspace:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
    Options: [DCMTL:2, DCLA:2]
And this is the file (IPs again redacted for clarity):
  # default for unknown nodes

One thing that did occur to me while reading the source code for the NetworkTopologyStrategy's
calculateNaturalEndpoints is that it prefers placing data on different racks.  Since all my
machines are defined as in the same rack, I believe that the 2-pass approach would still yield
balanced placement.

However, just to test, I modified live the topology file to specify that IPLA1, IPLA2 and
IPLA3 are in 3 different racks, and sure enough I saw immediately that the reads/second and
writes/second equalized to expected fair volume (I quickly reverted that change).

So, it seems somehow related to rack awareness, but I've been raking my head and I can't figure
out how/why, or why the three MTL machines are not affected the same way.

If the solution is to specify them in different racks and run repair on everything, I'm okay
with that - but I hate doing that without first understanding *why* the current behavior is
the way it is.

Any ideas would be hugely appreciated.

Thank you.

