cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10844) failed_bootstrap_wiped_node_can_join_test is failing
Date Wed, 30 Dec 2015 22:09:49 GMT


Joel Knighton commented on CASSANDRA-10844:

In [CASSANDRA-7069], we added consistent range movement to prevent concurrent bootstraps/decommissions.

We do this in the {{checkForEndpointCollision}} method.

In [CASSANDRA-7939], we observed that this prevented immediately retrying a failed bootstrap.
In order to avoid this, we switched to checking if a node is a fat client and not checking
{{isSafeForBootstrap}} in this situation, but instead iterating over
 all endpoint states and checking if any endpoints are in STATUS_LEAVING/STATUS_MOVING/STATUS_BOOTSTRAPPING.

However, this didn't solve the problem if a node had reached the point of setting this status
before failing its bootstrap.

In [CASSANDRA-8494], this deficiency was noticed in adding resumable bootstrapping and a line
was added in [this commit|]
to ignore this gossip state for a fat client with the same broadcastAddress as the bootstrapping
node. Since resumable bootstrapping went in to 2.2+ only, this explains why this test is failing
only on 2.1 (since we aren't ignoring the fat client gossip entry for our previous failed

This failing test was added in [CASSANDRA-9765], which addressed deficiencies in {{checkForEndpointCollision}}.

The consensus on 9765 was that bootstrapping is a safe state when checking for endpoint collisions
(deferring to 7939).

I think the best fix here is to backport the bootstrapping broadcastAddress check from 2.2
- what do you think [~Stefania]? Do you recall seeing a different behavior for this test on

> failed_bootstrap_wiped_node_can_join_test is failing
> ----------------------------------------------------
>                 Key: CASSANDRA-10844
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Streaming and Messaging, Testing
>            Reporter: Philip Thompson
>             Fix For: 2.1.x
>         Attachments: node1.log, node2.log
> {{bootstrap_test.TestBootstrap.failed_bootstap_wiped_node_can_join_test}} is failing
on 2.1-head. The second node fails to join the cluster. I see a lot of exceptions in node1's
log, such as 
> {code}
> ERROR [STREAM-OUT-/] 2015-12-11 12:06:13,778 - [Stream
#7b5ec5a0-a029-11e5-bad9-ffd0922f40e6] Streaming error occurred
> Broken pipe
>         at Method) ~[na:1.8.0_51]
>         at ~[na:1.8.0_51]
>         at ~[na:1.8.0_51]
>         at ~[na:1.8.0_51]
>         at ~[na:1.8.0_51]
>         at
>         at org.apache.cassandra.streaming.messages.StreamMessage.serialize(
>         at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(
>         at org.apache.cassandra.streaming.ConnectionHandler$
>         at [na:1.8.0_51]
> {code}
> Which seem consistent with node2 being killed, so the bootstrap fails. But then when
restarting node2, it does not join. It *looks* like it fails to rejoin because of a false
positive in checking the 2 minute rule.
> {code}
> ERROR [main] 2015-12-11 12:06:17,954 - Except
> ion encountered during startup
> java.lang.UnsupportedOperationException: Other bootstrapping/leaving/m
> oving nodes detected, cannot bootstrap while cassandra.consistent.rang
> emovement is true
>         at org.apache.cassandra.service.StorageService.checkForEndpoin
> tCollision( ~[main/:na]
>         at org.apache.cassandra.service.StorageService.prepareToJoin(S
> ~[main/:na]
>         at org.apache.cassandra.service.StorageService.initServer(Stor
> ~[main/:na]
>         at org.apache.cassandra.service.StorageService.initServer(Stor
> ~[main/:na]
>         at org.apache.cassandra.service.CassandraDaemon.setup(
>         at org.apache.cassandra.service.CassandraDaemon.activate(
>         at org.apache.cassandra.service.CassandraDaemon.main(
> {code}
> This fails consistently locally and on cassci. Logs attached.

This message was sent by Atlassian JIRA

View raw message