Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 54845 invoked from network); 29 Mar 2011 06:11:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Mar 2011 06:11:09 -0000 Received: (qmail 68185 invoked by uid 500); 29 Mar 2011 06:11:05 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 68120 invoked by uid 500); 29 Mar 2011 06:11:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 68112 invoked by uid 99); 29 Mar 2011 06:11:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2011 06:11:05 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.220.172] (HELO mail-vx0-f172.google.com) (209.85.220.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2011 06:10:55 +0000 Received: by vxg33 with SMTP id 33so3367790vxg.31 for ; Mon, 28 Mar 2011 23:10:34 -0700 (PDT) Received: by 10.52.179.9 with SMTP id dc9mr7053316vdc.26.1301379034369; Mon, 28 Mar 2011 23:10:34 -0700 (PDT) Received: from Epictetus.local (ool-18bee40b.dyn.optonline.net [24.190.228.11]) by mx.google.com with ESMTPS id i7sm906814vdu.21.2011.03.28.23.10.33 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 28 Mar 2011 23:10:33 -0700 (PDT) Date: Tue, 29 Mar 2011 02:10:32 -0400 From: =?utf-8?Q?Alexis_L=C3=AA-Qu=C3=B4c?= To: user@cassandra.apache.org Message-ID: Subject: Gossip mysteries (0.7.4 on EC2) X-Mailer: sparrow 1.1 (build 674) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org Hi, To make a long story short I'm trying to understand what the logic behind the gossip is. The following is an excerpt from a log captured today. 2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 has restarted, now UP again 2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 state jump to normal 2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 has restarted, now UP again 2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 state jump to normal 2011-03-28T18:37:56.559275+00:00 Binding thrift service to /10.208.214.3:9160 2011-03-28T18:37:56.563852+00:00 Using TFastFramedTransport with a max frame size of 15728640 bytes. 2011-03-28T18:37:56.574986+00:00 Listening for thrift clients... 2011-03-28T18:37:56.805426+00:00 Node /10.202.61.193 has restarted, now UP again 2011-03-28T18:37:56.806171+00:00 Nodes /10.202.61.193 and /10.96.81.193 have the same token [XXX]. Ignoring /10.202.61.193 What's surprising is that 10.96.81.193 has been gone for at least 1 week (and was the subject of a bug CASSANDRA-2371). > 2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 has restarted, now UP again > 2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 state jump to normal The code I'm seeing is: 624 private void handleMajorStateChange(InetAddress ep, EndpointState epState) 625 { 626 if (endpointStateMap.get(ep) != null) 627 logger.info("Node {} has restarted, now UP again", ep); endpointStateMap is not up-to-date. Shouldn't a gossip exchange mark the node as down immediately? Or if there's a network partition, at least the node could be marked as in an unknown state. I'm interpreting UP as available for cassandra's point of view. > 2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 has restarted, now UP again > 2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 state jump to normal That's fine, the node is actually up. > 2011-03-28T18:37:56.805426+00:00 Node /10.202.61.193 has restarted, now UP again Fine too, 10.202.61.193 is up (for real). 2011-03-28T18:37:56.806171+00:00 Nodes /10.202.61.193 and /10.96.81.193 have the same token [XXX]. Ignoring /10.202.61.193 Not clear to me on which basis that decision is made. From the code I see 728 InetAddress currentOwner = tokenMetadata_.getEndpoint(token); 729 if (currentOwner == null) 730 { ... 735 } 736 else if (endpoint.equals(currentOwner)) 737 { ... 741 } 742 else if (Gossiper.instance.compareEndpointStartup(endpoint, currentOwner) > 0) 743 { ... 749 } 750 else 751 { 752 logger_.info(String.format("Nodes %s and %s have the same token %s. Ignoring %s", 753 endpoint, currentOwner, token, endpoint)); 754 } So that decision is made by default, which in this particular case does not work. I've only cursorily looked at the source code so I don't know how tokenMetadata gets updated. I've also seen today non-convering gossip rings, where some ghost nodes show up, others don't. My recourse has been so far to removetoken (force) since initial removetokens have left one node streaming to another but the recipient unaware of such streaming as shown in the following excerpts: On Node 1 (sender) -- stuck forever in that state Mode: Normal Streaming to: /Node2 /data/cassandra/data/Intake/Metrics-f-2271-Data.db/(0,601518756),(601518756,700910088) progress=127695240/700910088 - 18% On Node 2 (recipient) at the very same moment: Mode: Normal Not sending any streams. Not receiving any streams. This kind of discrepancy is also hard to understand given that nodetool ring on Node 1 yields: ... Node2 Up Normal 52.04 GB 51.03% token1 and the same command on Node 2 yields: ... Node1 Up Normal 50.89 GB 23.97% token2 Any light shed on both issues is appreciated. -- Alexis Lê-Quôc (@datadoghq)