cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10844) failed_bootstrap_wiped_node_can_join_test is failing
Date Wed, 30 Dec 2015 22:09:49 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075465#comment-15075465
] 

Joel Knighton commented on CASSANDRA-10844:
-------------------------------------------

In [CASSANDRA-7069], we added consistent range movement to prevent concurrent bootstraps/decommissions.

We do this in the {{checkForEndpointCollision}} method.

In [CASSANDRA-7939], we observed that this prevented immediately retrying a failed bootstrap.
In order to avoid this, we switched to checking if a node is a fat client and not checking
{{isSafeForBootstrap}} in this situation, but instead iterating over
 all endpoint states and checking if any endpoints are in STATUS_LEAVING/STATUS_MOVING/STATUS_BOOTSTRAPPING.

However, this didn't solve the problem if a node had reached the point of setting this status
before failing its bootstrap.

In [CASSANDRA-8494], this deficiency was noticed in adding resumable bootstrapping and a line
was added in [this commit|https://github.com/yukim/cassandra/commit/5f7fd497ae83f813078d56ba1b61f7ea322e5d5a]
to ignore this gossip state for a fat client with the same broadcastAddress as the bootstrapping
node. Since resumable bootstrapping went in to 2.2+ only, this explains why this test is failing
only on 2.1 (since we aren't ignoring the fat client gossip entry for our previous failed
bootstrap).

This failing test was added in [CASSANDRA-9765], which addressed deficiencies in {{checkForEndpointCollision}}.


The consensus on 9765 was that bootstrapping is a safe state when checking for endpoint collisions
(deferring to 7939).

I think the best fix here is to backport the bootstrapping broadcastAddress check from 2.2
- what do you think [~Stefania]? Do you recall seeing a different behavior for this test on
2.1?



> failed_bootstrap_wiped_node_can_join_test is failing
> ----------------------------------------------------
>
>                 Key: CASSANDRA-10844
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10844
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Streaming and Messaging, Testing
>            Reporter: Philip Thompson
>             Fix For: 2.1.x
>
>         Attachments: node1.log, node2.log
>
>
> {{bootstrap_test.TestBootstrap.failed_bootstap_wiped_node_can_join_test}} is failing
on 2.1-head. The second node fails to join the cluster. I see a lot of exceptions in node1's
log, such as 
> {code}
> ERROR [STREAM-OUT-/127.0.0.2] 2015-12-11 12:06:13,778 StreamSession.java:505 - [Stream
#7b5ec5a0-a029-11e5-bad9-ffd0922f40e6] Streaming error occurred
> java.io.IOException: Broken pipe
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_51]
>         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_51]
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_51]
>         at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_51]
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_51]
>         at org.apache.cassandra.io.util.DataOutputStreamAndChannel.write(DataOutputStreamAndChannel.java:48)
~[main/:na]
>         at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:44)
~[main/:na]
>         at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:351)
[main/:na]
>         at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:331)
[main/:na]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
> {code}
> Which seem consistent with node2 being killed, so the bootstrap fails. But then when
restarting node2, it does not join. It *looks* like it fails to rejoin because of a false
positive in checking the 2 minute rule.
> {code}
> ERROR [main] 2015-12-11 12:06:17,954 CassandraDaemon.java:579 - Except
> ion encountered during startup
> java.lang.UnsupportedOperationException: Other bootstrapping/leaving/m
> oving nodes detected, cannot bootstrap while cassandra.consistent.rang
> emovement is true
>         at org.apache.cassandra.service.StorageService.checkForEndpoin
> tCollision(StorageService.java:559) ~[main/:na]
>         at org.apache.cassandra.service.StorageService.prepareToJoin(S
> torageService.java:789) ~[main/:na]
>         at org.apache.cassandra.service.StorageService.initServer(Stor
> ageService.java:721) ~[main/:na]
>         at org.apache.cassandra.service.StorageService.initServer(Stor
> ageService.java:612) ~[main/:na]
>         at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:387)
[main/:na]
>         at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:562)
[main/:na]
>         at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:651)
[main/:na]
> {code}
> This fails consistently locally and on cassci. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message