incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wright <kwri...@nanigans.com>
Subject Re: Unable to bootstrap new node
Date Mon, 07 Oct 2013 13:17:35 GMT
Hi all,

  We are still having issues bootstrapping nodes and this becoming quite a blocker for us.
 We are seeing the same behavior where bootstrapping the node causes one more existing nodes
to hang in GC (see attached screenshot).  Increasing heap and new size has not helped as well
as increasing phi to 12.  Email below gives more history.  Any ideas would be VERY welcome!

Thanks

From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
Date: Thursday, October 3, 2013 10:14 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Cc: Don Jackson <djackson@nanigans.com<mailto:djackson@nanigans.com>>
Subject: Re: Unable to bootstrap new node

Thanks for the response.   We are still having issues bootstrapping a node.  Quick background
on where we are at (1.2.8 with Vnodes):

 *   We had a node start to complain about corrupted SSTables which we tried to delete one
by one but it quickly became a whack-a-mole problem so we decided we would just wipe it and
bootstrap
 *   We shutdown that node and ran a nodetool removenode on another node
 *   We wiped the effected node's data and then attempted to bootstrap it (with the same IP
of course)
 *   Everytime we attempt to add the node 2 out of the 4 nodes sending data (the same 2 nodes
by the way) have streaming failures which I believe is caused by GC (see logging below). 
The streaming from these two nodes fails within the first couple minutes of bootstrapping
the node.
 *   We tried restarting the nodes that failed to stream but the bootstrapping node did not
automatically re-attempt the streaming and again we couldn't find a way to force it to
 *   We have tried upping the heap and new size on the nodes to help reduce the GC pressure
(from our original 10 GB to 14) but no luck and we also decreased stream_throughput_outbound_megabits_per_sec
from 400 to 200
 *   Eventually the bootstrapping node just hangs as it never gets data from the 2 nodes and
there is no way I can find to get it to re-attempt

I'm add a bit of a loss.  Honestly bootstrapping nodes has been a total nightmare for me and
makes me very concerned about our ability to fix/grow our cluster as needed.  I hoped Vnodes
would help but so far no luck.  Here are the options as I see it:

 *   Hope someone here has a great idea on how to fix it :)
 *   Assuming I can't get the node to bootstrap, I can start it with bootstrap disabled and
trigger a repair.  Is there anyway to ensure it doesn't serve any reads at this time?  I can
disable thrift/binary ports but it will still handle requests from other coordinator nodes.
 We usually run at read ANY so to ensure we don't miss data we would need to run at QUOROM
until the repair completes.

Thanks for the help!

Existing streaming node 1 (10.8.44.98):
ERROR [GossipTasks:1] 2013-10-03 13:09:28,654 AbstractStreamSession.java (line 110) Stream
failed because /10.8.44.84 died or was restarted/removed (streams may still be active in background,
but further streams won't be started)
ERROR [GossipTasks:1] 2013-10-03 13:09:28,720 AbstractStreamSession.java (line 110) Stream
failed because /10.8.44.84 died or was restarted/removed (streams may still be active in background,
but further streams won't be started)

Existing streaming node 2 (10.8.44.72):
ERROR [GossipTasks:1] 2013-10-03 13:10:02,174 AbstractStreamSession.java (line 110) Stream
failed because /10.8.44.84 died or was restarted/removed (streams may still be active in background,
but further streams won't be started)
ERROR [GossipTasks:1] 2013-10-03 13:10:02,185 AbstractStreamSession.java (line 110) Stream
failed because /10.8.44.84 died or was restarted/removed (streams may still be active in background,
but further streams won't be started)
ERROR [ReplicateOnWriteStage:38] 2013-10-03 13:10:02,265 FailureDetector.java (line 154) unknown
endpoint /10.8.44.84
ERROR [ReplicateOnWriteStage:36] 2013-10-03 13:10:02,302 FailureDetector.java (line 154) unknown
endpoint /10.8.44.84
ERROR [Native-Transport-Requests:151] 2013-10-03 13:10:02,282 FailureDetector.java (line 154)
unknown endpoint /10.8.44.84
ERROR [ReplicateOnWriteStage:37] 2013-10-03 13:10:02,318 FailureDetector.java (line 154) unknown
endpoint /10.8.44.84

Bootstrapping node (10.8.44.84):
ERROR [GossipTasks:1] 2013-10-03 13:09:23,196 AbstractStreamSession.java (line 110) Stream
failed because /10.8.44.98 died or was restarted/removed (streams may still be active in background,
but further streams won't be started)
ERROR [GossipTasks:1] 2013-10-03 13:09:24,199 AbstractStreamSession.java (line 110) Stream
failed because /10.8.44.72 died or was restarted/removed (streams may still be active in background,
but further streams won't be started)


From: Robert Coli <rcoli@eventbrite.com<mailto:rcoli@eventbrite.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, October 2, 2013 1:55 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Unable to bootstrap new node

On Wed, Oct 2, 2013 at 8:12 AM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
   We are running C* 1.2.8 with Vnodes enabled and are attempting to bootstrap a new node
and are having issues.  When we add the node we see it bootstrap and we see data start to
stream over from other nodes however we are seeing one of the other nodes get stuck in full
GCs to the point where we had to restart one of the nodes.  I assume this is because building
the merkle tree is expensive.

Merkle trees are only involved in "repair", not in normal bootstrap. Have you considered lowering
the throttle for streaming? Bootstrap will be slower but should be less likely to overwhelm
heap.

Any way to force the streaming to restart?   Have others seen this?

In the bootstrap case, you can just wipe the bootstrapping node and re-start the bootstrap.

In the general case regarding hung streaming :

https://issues.apache.org/jira/browse/CASSANDRA-3486

The only solution to hung non-bootstrap streaming is restart all nodes participating in the
streaming. With vnodes, this will probably approach 100% of nodes...

=Rob

Mime
View raw message