From commits-return-205669-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Fri Jan 26 01:19:06 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id D4088180651 for ; Fri, 26 Jan 2018 01:19:06 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C47AB160C4F; Fri, 26 Jan 2018 00:19:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E7AA2160C3D for ; Fri, 26 Jan 2018 01:19:05 +0100 (CET) Received: (qmail 62472 invoked by uid 500); 26 Jan 2018 00:19:05 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 62461 invoked by uid 99); 26 Jan 2018 00:19:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Jan 2018 00:19:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7FC8DC1F79 for ; Fri, 26 Jan 2018 00:19:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id skJ2FHdOZpul for ; Fri, 26 Jan 2018 00:19:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id BC2D55F570 for ; Fri, 26 Jan 2018 00:19:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7F51FE0EE9 for ; Fri, 26 Jan 2018 00:19:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 862CE2411E for ; Fri, 26 Jan 2018 00:19:00 +0000 (UTC) Date: Fri, 26 Jan 2018 00:19:00 +0000 (UTC) From: "Jason Brown (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340340#comment-16340340 ] Jason Brown commented on CASSANDRA-14155: ----------------------------------------- *WHAT IS HAPPENING?* So, the obvious is that we aren't finding the {{HOST_ID}} in the endpoint's state, but where is that data coming from? With CASSANDRA-10134 (in c* 3.6), we began [performing a shadow round of gossip|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L507] on every bounce of a node. The shadow round data comes from any peer in the node's seed list. To hit the NPE state, the shadow round data provided by the seed must contain an entry in the Map for the node's {{InetAddress}}, but must not contain the {{HOST_ID}}, and as I suspeect, no A{{pplicationStates}}; see next section. (Note that CASSANDRA-12653, committed to 3.11, moved the collected shadow round state from {{Gossiper#endpointStateMap}} to {{Gossiper#endpointShadowStateMap}}. However, I do not believe that will affect the observed behavior here). HOW ARE WE GETTING INTO THIS STATE? Barring some kind of Byzantine failure, my best guess is this: assume three nodes, A-B-C, and C is the node that hits the NPE. C contacts it's seed nodes (in this example, at a minimum B), and the response from B is the first one processed. Given the explaination above of how C processes B's shadow round data, I think B itself has just left it's own shadow round (by getting a response back to it's own shadow round, which assumably comes from A in this exmaple). Then, on B: - in {{StorageService#prepareToJoin()}}, we [{{loadRingState()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L788]. This will insert into C's {{Gossiper#endpointStateMap}} all of the peers (via {{InetAddress}}) that we knew about before the bounce. NOTE: we do not add in any previously known {{ApplicationState}}s. Thus, {{Gossiper#endpointStateMap}} contains {{InetAddress}} es which point to 'empty' {{EndpointState}} s (no populated {{ApplicationState}} s). - We then start the [{{Gossiper}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L792], after which we will start processing any incoming gossip message. - If the first incoming gossip message is a SYN from C, we will happily send back everything we know about the cluster. In the case of B, which has just bounced, it basically only knows the {{InetAddress}} es, of peers - no {{ApplicationStates}} Then, C gets back the (more or less) empty gossip data from B, and because it ["sees" it's own address in that shadowRoundData|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/Gossiper.java#L766], it assumes it should also see metadata ({{ApplicationState}} s) about it itself. That's when it looks up the {{HOST_ID}}, and [naively tries to dereference it|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/Gossiper.java#L788] - causing the NPE in this case. *SOLUTION:* I don't think we can change the distributed race on restart necessarily without larger structural changes, but we can change how a node determines if it can exit the shadow round. As we're basically only checking for a previous {{HOST_ID}} for the current node in the shadow round data, I propose we add a check to {{Gossiper@maybeFinishShadowRound()}} that, in addtion to the existing checks, loks if the data contains the {{HOST_ID}} for the current node. If so, exit the shadow round as usual; else, keep waiting for a more complete set of gossip data. ||14155|| |[branch|https://github.com/jasobrown/cassandra/tree/14155]| |[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14155]| For convenience, here's comparison against trunk (obviosuly, ignore the circleci yaml): [compare against trunk|https://github.com/apache/cassandra/compare/trunk...jasobrown:14155] NOTE: this patch is against trunk, but I think we'll also need it for 3.11 > [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) > -------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-14155 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14155 > Project: Cassandra > Issue Type: Bug > Reporter: Michael Kjellman > Assignee: Jason Brown > Priority: Major > > Gossiper is somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) > {code} > test teardown failure > Unexpected error found in node logs (see stdout for full details). Errors: [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception encountered during startup > java.lang.NullPointerException: null > at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) ~[main/:na] > at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511) ~[main/:na] > at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761) ~[main/:na] > at org.apache.cassandra.service.StorageService.initServer(StorageService.java:621) ~[main/:na] > at org.apache.cassandra.service.StorageService.initServer(StorageService.java:568) ~[main/:na] > at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) [main/:na] > at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569) [main/:na] > at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception encountered during startup > java.lang.NullPointerException: null > at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) ~[main/:na] > at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511) ~[main/:na] > at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761) ~[main/:na] > at org.apache.cassandra.service.StorageService.initServer(StorageService.java:621) ~[main/:na] > at org.apache.cassandra.service.StorageService.initServer(StorageService.java:568) ~[main/:na] > at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) [main/:na] > at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569) [main/:na] > at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) [main/:na]] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org