Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of chris.burroughs@gmail.com
 designates 209.85.216.41 as permitted sender)
Message-ID: <52683F74.4010300@gmail.com>
Date: Wed, 23 Oct 2013 17:28:20 -0400
From: Chris Burroughs <chris.burroughs@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130922 Icedove/17.0.9
MIME-Version: 1.0
To: user@cassandra.apache.org
CC: Philip Persad <philip.persad@gmail.com>
Subject: Re: nodetool status reporting dead node as UN
References: 
 <CABWW=XWaCt7hTphNnXMp-CzkFeT-yQCkv90V+iTVdGMGPRTHxA@mail.gmail.com>
In-Reply-To: 
 <CABWW=XWaCt7hTphNnXMp-CzkFeT-yQCkv90V+iTVdGMGPRTHxA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

When debugging gossip related problems (is this node really 
down/dead/some-werid state) you might have better luck looking at 
`nodetool gossipinfo`.  The "UN even though everything is bad thing" 
might be https://issues.apache.org/jira/browse/CASSANDRA-5913

I'm not sure what exactly what happened in your case.  I'm also confused 
why an IP changed on restart.

On 10/17/2013 06:12 PM, Philip Persad wrote:
> Hello,
>
> I seem to have gotten my cluster into a bit of a strange state.
> Pardon the rather verbose email, but there is a fair amount of
> background.  I'm running a 3 node Cassandra 2.0.1 cluster.  This
> particular cluster is used only rather intermittently for dev/testing
> and does not see particularly heavy use, it's mostly a catch-all
> cluster for environments which don't have a dedicated cluster to
> themselves.  I noticed today that one of the nodes had died because
> nodetool repair was failing due to a down replica.  I run nodetool
> status and sure enough, one of my nodes shows up as down.
>
> When I looked on the actual box, the cassandra process was up and
> running and everything in the logs looked sensible.  The most
> controversial thing I saw was 1 CMS Garbage Collection per hour, each
> taking ~250 ms.  None the less, the node was not responding, so I
> restarted it.  So far so good, everything is starting up, my ~30
> column families across ~6 key spaces are all initializing.  The node
> then handshakes with my other two nodes and reports them both as up.
> Here is where things get strange.  According to the logs on the other
> two nodes, the third node has come back up and all is well.  However
> in the third node, I see a wall of the following in the logs (IP
> addresses masked):
>
>   INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806)
> InetAddress /x.x.x.222 is now DOWN
>   INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806)
> InetAddress /x.x.x.221 is now DOWN
>   INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655
> OutboundTcpConnection.java (line 386) Handshaking version with
> /x.x.x.222
>   INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java
> (line 789) InetAddress /x.x.x.222 is now UP
>   INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806)
> InetAddress /x.x.x.222 is now DOWN
>   INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657
> OutboundTcpConnection.java (line 386) Handshaking version with
> /x.x.x.222
>   INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java
> (line 789) InetAddress /x.x.x.222 is now UP
>   INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java
> (line 789) InetAddress /x.x.x.222 is now UP
>   INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806)
> InetAddress /x.x.x.222 is now DOWN
>   INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660
> OutboundTcpConnection.java (line 386) Handshaking version with
> /x.x.x.222
>   INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java
> (line 789) InetAddress /x.x.x.222 is now UP
>   INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java
> (line 789) InetAddress /x.x.x.222 is now UP
>   INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254
> OutboundTcpConnection.java (line 386) Handshaking version with
> /x.x.x.221
>   INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806)
> InetAddress /x.x.x.222 is now DOWN
>   INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java
> (line 789) InetAddress /x.x.x.221 is now UP
>   INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java
> (line 789) InetAddress /x.x.x.221 is now UP
>   INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661
> OutboundTcpConnection.java (line 386) Handshaking version with
> /x.x.x.222
>   INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java
> (line 789) InetAddress /x.x.x.222 is now UP
>   INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806)
> InetAddress /x.x.x.222 is now DOWN
>   INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806)
> InetAddress /x.x.x.221 is now DOWN
>
> Additional, client requests to the cluster at consistency QUORUM start
> failing (saying 2 responses were required but only 1 replica
> responded).  According to nodetool status, all the nodes are up.
>
> This is clearly not good.  I take down the problem node.  Nodetool
> reports it down and QUORUM client reads/writes start working again.
> In an attempt to get the cluster back into a good state, I delete all
> the data on the problem node and then bring it back up.  The other two
> nodes log a changed host ID for the IP of the node I wiped and then
> handshake with it.  The problem node also comes up, but reads/writes
> start failing again with the same error.
>
> I decide to take the problem node down again.  However this time, even
> after the process is dead, nodetool and the other two nodes report
> that my third node is still up and requests to the cluster continue to
> fail.  Running nodetool status against either of the live nodes shows
> that all nodes are up.  Running nodetool status against the dead node
> fails (unsurprisingly since Cassandra is not even running).
>
> With that background out of the way, I have two questions.
>
> 1) What on earth just happened?
>
> 2) How do I fix my cluster?
>
> Thanks!
>