Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of john.pyeatt@singlewire.com
 designates 74.125.149.242 as permitted sender)
MIME-Version: 1.0
Date: Mon, 21 Oct 2013 13:11:32 -0500
Message-ID: 
 <CAEisTLk4UKi2+e8O_b5x32k85ieRMbEpuu4UfCT5sWGRnFfhrw@mail.gmail.com>
Subject: decommission of one EC2 node in cluster causes other nodes to go
 DOWN/UP and results in "May not be enough replicas..."
From: John Pyeatt <john.pyeatt@singlewire.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=047d7b10c851c44fec04e9443624

--047d7b10c851c44fec04e9443624
Content-Type: text/plain; charset=ISO-8859-1

We have a 6 node cassandra 1.2.10 cluster running on aws with
NetworkTopologyStrategy, a replication factor of 3 and the EC2Snitch. Each
AWS availability zone has 2 nodes in it.

When we are reading or writing data with consistency of Quorum to the
cluster while decommissioning a node we are getting 'May not be enough
replicas present to handle consistency level".

This doesn't make sense because we are only taking one node down, we have
an RF of three so even if we take one node down with a quorum read/write
there should still be enough nodes with the data (2).

Looking at the cassandra log on a server that we are not decommissioning we
are seeing this during the decommission of the other node.

 INFO [GossipTasks:1] 2013-10-21 15:18:10,695 Gossiper.java (line 803)
InetAddress /10.0.22.142 *is now DOWN*
 INFO [GossipTasks:1] 2013-10-21 15:18:10,696 Gossiper.java (line 803)
InetAddress /10.0.32.159 *is now DOWN*
 INFO [HANDSHAKE-/10.0.22.142] 2013-10-21 15:18:10,862
OutboundTcpConnection.java (line 399) Handshaking version with /10.0.22.142
 INFO [GossipTasks:1] 2013-10-21 15:18:11,696 Gossiper.java (line 803)
InetAddress /10.0.12.178* is now DOWN*
 INFO [GossipTasks:1] 2013-10-21 15:18:11,697 Gossiper.java (line 803)
InetAddress /10.0.22.106* is now DOWN*
 INFO [GossipTasks:1] 2013-10-21 15:18:11,698 Gossiper.java (line 803)
InetAddress /10.0.32.248 *is now DOWN*

Eventually we are seeing a message that looks like this.
 INFO [GossipStage:3] 2013-10-21 15:18:19,429 Gossiper.java (line 789)
InetAddress /10.0.32.248 is now UP

for each of the nodes. So eventually the remaining nodes in the cluster
come back to life.

While these nodes are down I can see why we get the "May not be enough
replicas..." message. Because everything is down.

My question is *why does gossip shutdown for these nodes that we aren't
decommissioning in the first place*?

-- 
John Pyeatt
Singlewire Software, LLC
www.singlewire.com
------------------
608.661.1184
john.pyeatt@singlewire.com

--047d7b10c851c44fec04e9443624
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>We have a 6 node cassandra 1.2.10 cluster running on =
aws with NetworkTopologyStrategy, a replication factor of 3 and the EC2Snit=
ch. Each AWS availability zone has 2 nodes in it.<br><br></div><div><div>Wh=
en we are reading or writing data with consistency of Quorum to the cluster=
 while decommissioning a node we are getting &#39;May not be enough replica=
s present to handle consistency level&quot;.<br>
<br></div><div>This doesn&#39;t make sense because we are only taking one n=
ode down, we have an RF of three so even if we take one node down with a qu=
orum read/write there should still be enough nodes with the data (2). <br>
<br></div><div>Looking at the cassandra log on a server that we are not dec=
ommissioning we are seeing this during the decommission of the other node.<=
br><br>=A0INFO [GossipTasks:1] 2013-10-21 15:18:10,695 Gossiper.java (line =
803) InetAddress /<a href=3D"http://10.0.22.142" target=3D"_blank">10.0.22.=
142</a> <b>is now DOWN</b><br>

=A0INFO [GossipTasks:1] 2013-10-21 15:18:10,696 Gossiper.java (line 803) In=
etAddress /<a href=3D"http://10.0.32.159" target=3D"_blank">10.0.32.159</a>=
 <b>is now DOWN</b><br>
=A0INFO [HANDSHAKE-/<a href=3D"http://10.0.22.142" target=3D"_blank">10.0.2=
2.142</a>] 2013-10-21 15:18:10,862 OutboundTcpConnection.java (line 399) Ha=
ndshaking version with /<a href=3D"http://10.0.22.142" target=3D"_blank">10=
.0.22.142</a><br>

=A0INFO [GossipTasks:1] 2013-10-21 15:18:11,696 Gossiper.java (line 803) In=
etAddress /<a href=3D"http://10.0.12.178" target=3D"_blank">10.0.12.178</a>=
<b> is now DOWN</b><br>
=A0INFO [GossipTasks:1] 2013-10-21 15:18:11,697 Gossiper.java (line 803) In=
etAddress /<a href=3D"http://10.0.22.106" target=3D"_blank">10.0.22.106</a>=
<b> is now DOWN</b><br>
=A0INFO [GossipTasks:1] 2013-10-21 15:18:11,698 Gossiper.java (line 803) In=
etAddress /<a href=3D"http://10.0.32.248" target=3D"_blank">10.0.32.248</a>=
 <b>is now DOWN</b><br></div><div><br></div><div>Eventually we are seeing a=
 message that looks like this.<br>
=A0INFO [GossipStage:3] 2013-10-21 15:18:19,429 Gossiper.java (line 789) In=
etAddress /<a href=3D"http://10.0.32.248" target=3D"_blank">10.0.32.248</a>=
 is now UP<br></div><div><br></div><div>for each of the nodes. So eventuall=
y the remaining nodes in the cluster come back to life.<br>
<br>While these nodes are down I can see why we get the &quot;May not be en=
ough replicas...&quot; message. Because everything is down.<br><br>My quest=
ion is <u>why does gossip shutdown for these nodes that we aren&#39;t decom=
missioning in the first place</u>?<br>
<br></div><div>-- <br>
John Pyeatt<br>

Singlewire Software, LLC<br>
<a href=3D"http://www.singlewire.com/" target=3D"_blank">www.singlewire.com=
</a><br>
------------------<br>
608.661.1184<br><font color=3D"#888888">
<a href=3D"mailto:john.pyeatt@singlewire.com" target=3D"_blank">john.pyeatt=
@singlewire.com</a></font>
</div></div></div>

--047d7b10c851c44fec04e9443624--