Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of janne.jalkanen@ecyrd.com
 designates 87.108.86.67 as permitted sender)
From: Janne Jalkanen <janne.jalkanen@ecyrd.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_678E6E5B-048E-4B00-8C37-1ECCC8827602"
Message-Id: <291531F8-BA29-40D8-AD31-6122BD161E6F@ecyrd.com>
Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\))
Subject: Re: Data loss when swapping out cluster
Date: Tue, 26 Nov 2013 16:14:25 +0200
References: 
 <CAAw6nKurpgHOUx39r+17Z2-YvS4aM=oQ57Vbt+DF7uRzx6HyAQ@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CAAw6nKurpgHOUx39r+17Z2-YvS4aM=oQ57Vbt+DF7uRzx6HyAQ@mail.gmail.com>


--Apple-Mail=_678E6E5B-048E-4B00-8C37-1ECCC8827602
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii


That sounds bad!  Did you run repair at any stage?  Which CL are you =
reading with?=20

/Janne

On 25 Nov 2013, at 19:00, Christopher J. Bottaro =
<cjbottaro@academicworks.com> wrote:

> Hello,
>=20
> We recently experienced (pretty severe) data loss after moving our 4 =
node Cassandra cluster from one EC2 availability zone to another.  Our =
strategy for doing so was as follows:
> One at a time, bring up new nodes in the new availability zone and =
have them join the cluster.
> One at a time, decommission the old nodes in the old availability zone =
and turn them off (stop the Cassandra process).
> Everything seemed to work as expected.  As we decommissioned each =
node, we checked the logs for messages indicating "yes, this node is =
done decommissioning" before turning the node off.
>=20
> Pretty quickly after the old nodes left the cluster, we started =
getting client calls about data missing.
>=20
> We immediately turned the old nodes back on and when they rejoined the =
cluster *most* of the reported missing data returned.  For the rest of =
the missing data, we had to spin up a new cluster from EBS snapshots and =
copy it over.
>=20
> What did we do wrong?
>=20
> In hindsight, we noticed a few things which may be clues...
> The new nodes had much lower load after joining the cluster than the =
old ones (3-4 gb as opposed to 10 gb).
> We have EC2Snitch turned on, although we're using SimpleStrategy for =
replication.
> The new nodes showed even ownership (via nodetool status) after =
joining the cluster.
> Here's more info about our cluster...
> Cassandra 1.2.10
> Replication factor of 3
> Vnodes with 256 tokens
> All tables made via CQL
> Data dirs on EBS (yes, we are aware of the performance implications)
>=20
> Thanks for the help.


--Apple-Mail=_678E6E5B-048E-4B00-8C37-1ECCC8827602
Content-Transfer-Encoding: 7bit
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><br></div><div>That sounds bad! &nbsp;Did you run repair at any stage? &nbsp;Which CL are you reading with?&nbsp;</div><div><br></div><div>/Janne</div><br><div><div>On 25 Nov 2013, at 19:00, Christopher J. Bottaro &lt;<a href="mailto:cjbottaro@academicworks.com">cjbottaro@academicworks.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">Hello,<div><br></div><div>We recently experienced (pretty severe) data loss after moving our 4 node Cassandra cluster from one EC2 availability zone to another. &nbsp;Our strategy for doing so was as follows:</div>
<div><ul><li>One at a time, bring up new nodes in the new availability zone and have them join the cluster.</li><li>One at a time, decommission the old nodes in the old availability zone and turn them off (stop the Cassandra process).</li>
</ul><div>Everything seemed to work as expected. &nbsp;As we decommissioned each node, we checked the logs for messages indicating "yes, this node is done decommissioning" before turning the node off.</div></div><div>
<br></div><div>Pretty quickly after the old nodes left the cluster, we started getting client calls about data missing.</div><div><br></div><div>We immediately turned the old nodes back on and when they rejoined the cluster *most* of the reported missing data returned. &nbsp;For the rest of the missing data, we had to spin up a new cluster from EBS snapshots and copy it over.</div>
<div><br></div><div>What did we do wrong?</div><div><br></div><div>In hindsight, we noticed a few things which may be clues...</div><div><ul><li>The new nodes had much lower load after joining the cluster than the old ones (3-4 gb as opposed to 10 gb).</li>
<li>We have EC2Snitch turned on, although we're using SimpleStrategy for replication.</li><li>The new nodes showed even ownership (via nodetool status) after joining the cluster.</li></ul><div>Here's more info about our cluster...</div>
<div><ul><li>Cassandra 1.2.10</li><li>Replication factor of 3</li><li>Vnodes with 256 tokens</li><li>All tables made via CQL<br></li><li>Data dirs on EBS (yes, we are aware of the performance implications)</li></ul></div>
</div><div><br></div><div>Thanks for the help.</div></div>
</blockquote></div><br></body></html>
--Apple-Mail=_678E6E5B-048E-4B00-8C37-1ECCC8827602--