From Reverend Chip <>
Subject loadbalance kills gossip?
Date Sat, 06 Nov 2010 23:55:47 GMT
More weirdness with my four-or-five-node cluster of 0.7 beta3.  Having 
brought up all five nodes, including the one that didn't loadbalance
right, I tried loadbalancing it again.  (This is under completely idle
conditions - no external reads or writes.)  The result is a cluster
where each node thinks it's the only one thats' Up.  (Or so they all
report when queried with "nodetool ring".)  It's been 15 minutes and no
nodes will talk to any other nodes, at all.  I already tried restarting
them all, and it's happened again.

Some more detail: This is a four-node cluster where hosts X.22 through
X.19 have been up and running, accepting a lot of data, over several
days.  Their loads are all about 500GB now.  (Actually their data disks
are more than 50% full, which is why I'm trying to add four more nodes,
one at a time.)  I brought up X.18, which correctly gave itself a good
token, but didn't stream itself any data.  So I figured I'd kick off the
streaming process with a "loadbalance" command.  I ran
    nodetool -h X.18 loadbalance
which kind of worked; it got as far as 'waiting 90s for load
information' in its log.  But this operation seems also to have stopped
gossip altogether ring indicated that only X.22 was up, and X.21, X.20,
and X.19 had taken themselves down (or at least out of gossip).  When I
looked at the log for X.21, I found the below.  The first line looks
normal; host X.18 had taken itself down for loadbalancing, after all. 
But then this node also decided that everyone else was dead.  (As did
all the other nodes, about all the others.)

 INFO [GossipStage:1] 2010-11-06 16:39:45,109
(line 151) Deleting any stored hints for /X.18
 INFO [GossipStage:1] 2010-11-06 16:39:45,116
(line 631) switching in a fresh Memtable for HintsColumnFamily at
 INFO [GossipStage:1] 2010-11-06 16:39:45,117
(line 930) Enqueuing flush of Memtable-HintsColumnFamily@1878733456(0
bytes, 0 operations)
 INFO [FlushWriter:1] 2010-11-06 16:39:45,118 (line 154)
Writing Memtable-HintsColumnFamily@1878733456(0 bytes, 0 operations)
 INFO [ScheduledTasks:1] 2010-11-06 16:39:46,080 (line
133) GC for ParNew: 320 ms, 151307960 reclaimed leaving 9401921352 used;
max is 34557919232
 INFO [FlushWriter:1] 2010-11-06 16:39:46,242 (line 161)
Completed flushing
 INFO [ScheduledTasks:1] 2010-11-06 16:39:53,921 (line
195) InetAddress /X.22 is now dead.
 INFO [ScheduledTasks:1] 2010-11-06 16:39:54,922 (line
195) InetAddress /X.20 is now dead.
 INFO [ScheduledTasks:1] 2010-11-06 16:39:55,924 (line
195) InetAddress /X.19 is now dead.

