Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 68229200BB6 for ; Fri, 21 Oct 2016 06:35:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 66A1D160AF2; Fri, 21 Oct 2016 04:35:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 84CBB160AE0 for ; Fri, 21 Oct 2016 06:34:59 +0200 (CEST) Received: (qmail 85493 invoked by uid 500); 21 Oct 2016 04:34:58 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 85481 invoked by uid 99); 21 Oct 2016 04:34:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Oct 2016 04:34:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 56A992C0B04 for ; Fri, 21 Oct 2016 04:34:58 +0000 (UTC) Date: Fri, 21 Oct 2016 04:34:58 +0000 (UTC) From: "Stefania (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-12781) dtest failure in pushed_notifications_test.TestPushedNotifications.restart_node_test MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 21 Oct 2016 04:35:00 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-12781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593950#comment-15593950 ] Stefania commented on CASSANDRA-12781: -------------------------------------- The healthy DOWN and UP notifications are all separated by just over 20 seconds. When we receive the extra DOWN notification, the UP notification follows within less than a second from the DOWN notification: {code} dtest: DEBUG: Got {'change_type': u'NEW_NODE', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:16:57.887774 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:16:57.888112 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:16:57.934182 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:18.734717 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:18.810279 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:38.922725 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:38.983538 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:59.823012 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:59.861144 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:20.984716 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:21.038949 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:21.986869 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:33.047814 {code} So the problem is a bogus UP notification, which was caused by a GOSSIP message that caused {{realMarkAlive()}} to be called (note the {{is now UP}} log message, which is logged by {{realMarkAlive()}}: {code} INFO [GossipStage:1] 2016-10-05 04:18:21,037 Gossiper.java:1034 - InetAddress /127.0.0.2 is now DOWN DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket close for /127.0.0.2 DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket close for /127.0.0.2 DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket close for /127.0.0.2 DEBUG [MessagingService-Outgoing-/127.0.0.2-Small] 2016-10-05 04:18:21,038 OutboundTcpConnection.java:373 - Socket to /127.0.0.2 closed DEBUG [GossipStage:1] 2016-10-05 04:18:21,038 FailureDetector.java:323 - Forcing conviction of /127.0.0.2 DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 OutboundTcpConnection.java:373 - Socket to /127.0.0.2 closed DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 OutboundTcpConnection.java:387 - Attempting to connect to /127.0.0.2 INFO [HANDSHAKE-/127.0.0.2] 2016-10-05 04:18:21,045 OutboundTcpConnection.java:512 - Handshaking version with /127.0.0.2 DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,046 OutboundTcpConnection.java:481 - Done connecting to /127.0.0.2 DEBUG [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1018 - removing expire time for endpoint : /127.0.0.2 INFO [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1019 - InetAddress /127.0.0.2 is now UP {code} Since CASSANDRA-8236, {{realMarkAlive()}} won't generate an UP notification if {{RPC_READY}} is false (see {{SS.notifyUp()}}). Therefore my guess is that the problem is that {{RPC_READY}} is only set to false in {{CassandraDaemon.stop()}}, not when the RPC services are stopped by {{SS.drain()}}, which is in {{CassandraDaemon.shutdownClientServers()}}. My plan is to see how easily we can reproduce this in a multiplexed run, and if we can reproduce it then we can test the assumption above. If we cannot reproduce it, we should still set {{RPC_READY}} to false in {{CassandraDaemon.shutdownClientServers()}}, I think this should be safe even for 3.0, and is the correct thing to do. > dtest failure in pushed_notifications_test.TestPushedNotifications.restart_node_test > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-12781 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12781 > Project: Cassandra > Issue Type: Bug > Reporter: Sean McCarthy > Assignee: Stefania > Labels: dtest > Fix For: 3.x > > Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, node2_debug.log, node2_gc.log > > > example failure: > http://cassci.datastax.com/job/cassandra-3.X_dtest/4/testReport/pushed_notifications_test/TestPushedNotifications/restart_node_test > {code} > Error Message > [{'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)}, {'change_type': u'UP', 'address': ('127.0.0.2', 9042)}, {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)}] > {code} > {code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/pushed_notifications_test.py", line 181, in restart_node_test > self.assertEquals(expected_notifications, len(notifications), notifications) > File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual > assertion_func(first, second, msg=msg) > File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual > raise self.failureException(msg) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)