Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5FEEB1041D for ; Wed, 23 Oct 2013 21:28:50 +0000 (UTC) Received: (qmail 77696 invoked by uid 500); 23 Oct 2013 21:28:03 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 77578 invoked by uid 500); 23 Oct 2013 21:27:58 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 77566 invoked by uid 99); 23 Oct 2013 21:27:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Oct 2013 21:27:56 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chris.burroughs@gmail.com designates 209.85.216.41 as permitted sender) Received: from [209.85.216.41] (HELO mail-qa0-f41.google.com) (209.85.216.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Oct 2013 21:27:50 +0000 Received: by mail-qa0-f41.google.com with SMTP id f11so4548693qae.7 for ; Wed, 23 Oct 2013 14:27:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=JZVBbXzmBxG8/t9eZ2t8lf8X3fQmRfExNYk9J3E87BA=; b=NGZKZNWftwlxahJUYumgddNsVGXiqdU3n8vuQhqFxRpowLMLpEci8dix/9MKuQJJrf wlqomipeZgHXKTrqKkc1Z3noQqDOTVwPNQiuvp6eA1goSyI3C6CVP2dMCVYfK8ok3eky 5yTPJWul6PjvI1ayfbJNvLIpzWqPB6qgf8sz59vcfaUbGemb2l02TR/3Wp1/wS+JTZC3 TKQi+XkPIy19dkhHgqzved5by85Ac71Gh2XNf7cw8omwE4TdX3jgKVNfrDHLWJ+NqpA0 q4huT8ZSiVhMDekarfugjy1mmxYDeJyxuD8CwK3amdUja3nFc48PDKZyKl9MsjX3h7hO HiYA== X-Received: by 10.224.171.67 with SMTP id g3mr6840234qaz.13.1382563649752; Wed, 23 Oct 2013 14:27:29 -0700 (PDT) Received: from [192.168.1.142] (208-58-66-240.c3-0.161-ubr1.lnh-161.md.cable.rcn.com. [208.58.66.240]) by mx.google.com with ESMTPSA id a9sm56420413qed.6.2013.10.23.14.27.28 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 23 Oct 2013 14:27:29 -0700 (PDT) Message-ID: <52683F74.4010300@gmail.com> Date: Wed, 23 Oct 2013 17:28:20 -0400 From: Chris Burroughs User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130922 Icedove/17.0.9 MIME-Version: 1.0 To: user@cassandra.apache.org CC: Philip Persad Subject: Re: nodetool status reporting dead node as UN References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org When debugging gossip related problems (is this node really down/dead/some-werid state) you might have better luck looking at `nodetool gossipinfo`. The "UN even though everything is bad thing" might be https://issues.apache.org/jira/browse/CASSANDRA-5913 I'm not sure what exactly what happened in your case. I'm also confused why an IP changed on restart. On 10/17/2013 06:12 PM, Philip Persad wrote: > Hello, > > I seem to have gotten my cluster into a bit of a strange state. > Pardon the rather verbose email, but there is a fair amount of > background. I'm running a 3 node Cassandra 2.0.1 cluster. This > particular cluster is used only rather intermittently for dev/testing > and does not see particularly heavy use, it's mostly a catch-all > cluster for environments which don't have a dedicated cluster to > themselves. I noticed today that one of the nodes had died because > nodetool repair was failing due to a down replica. I run nodetool > status and sure enough, one of my nodes shows up as down. > > When I looked on the actual box, the cassandra process was up and > running and everything in the logs looked sensible. The most > controversial thing I saw was 1 CMS Garbage Collection per hour, each > taking ~250 ms. None the less, the node was not responding, so I > restarted it. So far so good, everything is starting up, my ~30 > column families across ~6 key spaces are all initializing. The node > then handshakes with my other two nodes and reports them both as up. > Here is where things get strange. According to the logs on the other > two nodes, the third node has come back up and all is well. However > in the third node, I see a wall of the following in the logs (IP > addresses masked): > > INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806) > InetAddress /x.x.x.222 is now DOWN > INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806) > InetAddress /x.x.x.221 is now DOWN > INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655 > OutboundTcpConnection.java (line 386) Handshaking version with > /x.x.x.222 > INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java > (line 789) InetAddress /x.x.x.222 is now UP > INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806) > InetAddress /x.x.x.222 is now DOWN > INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657 > OutboundTcpConnection.java (line 386) Handshaking version with > /x.x.x.222 > INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java > (line 789) InetAddress /x.x.x.222 is now UP > INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java > (line 789) InetAddress /x.x.x.222 is now UP > INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806) > InetAddress /x.x.x.222 is now DOWN > INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660 > OutboundTcpConnection.java (line 386) Handshaking version with > /x.x.x.222 > INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java > (line 789) InetAddress /x.x.x.222 is now UP > INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java > (line 789) InetAddress /x.x.x.222 is now UP > INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254 > OutboundTcpConnection.java (line 386) Handshaking version with > /x.x.x.221 > INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806) > InetAddress /x.x.x.222 is now DOWN > INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java > (line 789) InetAddress /x.x.x.221 is now UP > INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java > (line 789) InetAddress /x.x.x.221 is now UP > INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661 > OutboundTcpConnection.java (line 386) Handshaking version with > /x.x.x.222 > INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java > (line 789) InetAddress /x.x.x.222 is now UP > INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806) > InetAddress /x.x.x.222 is now DOWN > INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806) > InetAddress /x.x.x.221 is now DOWN > > Additional, client requests to the cluster at consistency QUORUM start > failing (saying 2 responses were required but only 1 replica > responded). According to nodetool status, all the nodes are up. > > This is clearly not good. I take down the problem node. Nodetool > reports it down and QUORUM client reads/writes start working again. > In an attempt to get the cluster back into a good state, I delete all > the data on the problem node and then bring it back up. The other two > nodes log a changed host ID for the IP of the node I wiped and then > handshake with it. The problem node also comes up, but reads/writes > start failing again with the same error. > > I decide to take the problem node down again. However this time, even > after the process is dead, nodetool and the other two nodes report > that my third node is still up and requests to the cluster continue to > fail. Running nodetool status against either of the live nodes shows > that all nodes are up. Running nodetool status against the dead node > fails (unsurprisingly since Cassandra is not even running). > > With that background out of the way, I have two questions. > > 1) What on earth just happened? > > 2) How do I fix my cluster? > > Thanks! >