incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Burroughs <>
Subject gossip settling and bootstrap problems
Date Tue, 08 Oct 2013 00:45:24 GMT
I've been running into a variety of tricky to diagnose problems recently 
that could be summarized as "bootstrap & related tasks fail without 
extra hacky sleep time".

This is a sample edited log file for bootstrapping a node that captures 
the general dynamics:  This build has been 
modified (from 1.2.10) to sleep 4*RING_DELAY in 
StorageService.bootstrap().  A few notes:
  * At 30s nodes are still flapping UP and DOWN
  * handshaking is still going strong at 90s
  * Things do stabilize; they don't flap indefinitely
  * Bootstrap succeeds once it starts.  In this particular cluster a 
default RING_DELAY/build (30s) fails every time.

Ping times, TCP retransmit, and other general network stuff look fine. 
There are several different tickets (some from me) that reference what 
seemed to me to be possibly similar or at least correlated issues:
  * CASSANDRA-4288 : prevent thrift server from starting before gossip 
has settled
  * CASSANDRA-5815 : NPE from migration manager
  * CASSANDRA-5915 : node flapping prevents replace_node from succeeding 
  * CASSANDRA-6156 : Poor resilience and recovery for bootstrapping node 
- "unable to fetch range"
  * CASSANDRA-6127 : vnodes don't scale to hundreds of nodes

I suspect that a combination of factors is causing gossip to take longer 
to stabilize:
  * vnodes
  * (cross country or greater) multi-dc
  * bigger than a test cluster (> 50 nodes)
  * reconnecting snitch

What are other people seeing in their clusters?  Doe anyone routinely 
change RING_DELAY (google finds precious few references)?

View raw message