cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12781) dtest failure in pushed_notifications_test.TestPushedNotifications.restart_node_test
Date Fri, 21 Oct 2016 04:34:58 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593950#comment-15593950
] 

Stefania commented on CASSANDRA-12781:
--------------------------------------

The healthy DOWN and UP notifications are all separated by just over 20 seconds. When we receive
the extra DOWN notification, the UP notification follows within less than a second from the
DOWN notification:

{code}
dtest: DEBUG: Got {'change_type': u'NEW_NODE', 'address': ('127.0.0.2', 9042)} from 127.0.0.1
at 2016-10-05 04:16:57.887774
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at
2016-10-05 04:16:57.888112

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1
at 2016-10-05 04:16:57.934182
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at
2016-10-05 04:17:18.734717

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1
at 2016-10-05 04:17:18.810279
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at
2016-10-05 04:17:38.922725

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1
at 2016-10-05 04:17:38.983538
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at
2016-10-05 04:17:59.823012

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1
at 2016-10-05 04:17:59.861144
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at
2016-10-05 04:18:20.984716

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1
at 2016-10-05 04:18:21.038949
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at
2016-10-05 04:18:21.986869
dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1
at 2016-10-05 04:18:33.047814
{code}

So the problem is a bogus UP notification, which was caused by a GOSSIP message that caused
{{realMarkAlive()}} to be called (note
the {{is now UP}} log message, which is logged by {{realMarkAlive()}}:

{code}
INFO  [GossipStage:1] 2016-10-05 04:18:21,037 Gossiper.java:1034 - InetAddress /127.0.0.2
is now DOWN
DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket
close for /127.0.0.2
DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket
close for /127.0.0.2
DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket
close for /127.0.0.2
DEBUG [MessagingService-Outgoing-/127.0.0.2-Small] 2016-10-05 04:18:21,038 OutboundTcpConnection.java:373
- Socket to /127.0.0.2 closed
DEBUG [GossipStage:1] 2016-10-05 04:18:21,038 FailureDetector.java:323 - Forcing conviction
of /127.0.0.2
DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 OutboundTcpConnection.java:373
- Socket to /127.0.0.2 closed
DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 OutboundTcpConnection.java:387
- Attempting to connect to /127.0.0.2
INFO  [HANDSHAKE-/127.0.0.2] 2016-10-05 04:18:21,045 OutboundTcpConnection.java:512 - Handshaking
version with /127.0.0.2
DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,046 OutboundTcpConnection.java:481
- Done connecting to /127.0.0.2
DEBUG [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1018 - removing expire
time for endpoint : /127.0.0.2
INFO  [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1019 - InetAddress /127.0.0.2
is now UP
{code}

Since CASSANDRA-8236, {{realMarkAlive()}} won't generate an UP notification if {{RPC_READY}}
is false (see {{SS.notifyUp()}}). Therefore my guess is that the problem is that {{RPC_READY}}
is only set to false in {{CassandraDaemon.stop()}}, not when the RPC services are stopped
by {{SS.drain()}}, which is in {{CassandraDaemon.shutdownClientServers()}}.

My plan is to see how easily we can reproduce this in a multiplexed run, and if we can reproduce
it then we can test the assumption above. If we cannot reproduce it, we should still set {{RPC_READY}}
to false in {{CassandraDaemon.shutdownClientServers()}}, I think this should be safe even
for 3.0, and is the correct thing to do.

> dtest failure in pushed_notifications_test.TestPushedNotifications.restart_node_test
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12781
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12781
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sean McCarthy
>            Assignee: Stefania
>              Labels: dtest
>             Fix For: 3.x
>
>         Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, node2_debug.log,
node2_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.X_dtest/4/testReport/pushed_notifications_test/TestPushedNotifications/restart_node_test
> {code}
> Error Message
> [{'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)}, {'change_type': u'UP', 'address':
('127.0.0.2', 9042)}, {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)}]
> {code}
> {code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/pushed_notifications_test.py", line 181, in restart_node_test
>     self.assertEquals(expected_notifications, len(notifications), notifications)
>   File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual
>     assertion_func(first, second, msg=msg)
>   File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual
>     raise self.failureException(msg)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message