Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of dave@luciddg.com designates
 209.85.214.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJY4Zhdczwzbo8-9s3ZETkE6K3cfQPP2hdRkt4fjtM9gnznGmA@mail.gmail.com>
References: 
 <CAJY4Zhdczwzbo8-9s3ZETkE6K3cfQPP2hdRkt4fjtM9gnznGmA@mail.gmail.com>
Date: Fri, 4 Oct 2013 14:29:33 -0700
Message-ID: 
 <CAJY4Zhft+d7BjxDv4Jb+Fe1S4c5OPkccVuD-q5JnvHxOzOTXhA@mail.gmail.com>
Subject: Re: nodetool repair fails after expansion
From: Dave Cowen <dave@luciddg.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=bcaec52c5b27a0f2c404e7f0ff5f

--bcaec52c5b27a0f2c404e7f0ff5f
Content-Type: text/plain; charset=ISO-8859-1

I should clarify that we are running Cassandra 1.1.12.

Dave


On Fri, Oct 4, 2013 at 2:08 PM, Dave Cowen <dave@luciddg.com> wrote:

> We're testing expanding a 4-node cluster into an 8-node cluster, and we
> keep running into issues with the repair process near the end.
>
> We're bringing up nodes 1-by-1 into the cluster, retokening nodes for an
> 8-node configuration, running nodetool cleanup on the nodes after each
> retokening, and then increasing the replication factor to 5. This all works
> without issue, and the cluster appears to be healthy in that 8-node
> configuration with a replication factor of 5.
>
> However, when we then run nodetool repair on the nodes, it will at some
> point stall, even when being run on one of the new nodes.
>
> It doesn't appear to stall while it's performing a compaction or
> transferring CF data. We've monitored compactionstats and netstats closely,
> and things always stall when a repair command is started, ie:
>
> [2013-10-02 23:19:39,254] Starting repair command #9, repairing 5 ranges
> for keyspace ourkeyspace
>
> The last message from AntiEntropyService is usually something to the
> effect of:
>
> <190>Oct  3 00:01:02 myhost.com 1970947950 [AntiEntropySessions:24] INFO
>  org.apache.cassandra.service.AntiEntropyService  - [repair
> #9b17d310-2bbd-11e3-0000-e06ec6c436ff] session completed successfully
>
> ... and then things don't start for the next repair. Nothing in the logs
> that looks related.
>
> Where this occurs is arbitrary. If I run on individual CFs within
> ourkeyspace, some will succeed, and some will fail, but if we start over
> and do the 4-node to 8-node expansion again, things will fail at a
> different place.
>
> Advice as to what to look at next?
>
> Thanks,
>
> Dave
>

--bcaec52c5b27a0f2c404e7f0ff5f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I should clarify that we are running Cassandra 1.1.12.<div=
><br>Dave</div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Fri, Oct 4, 2013 at 2:08 PM, Dave Cowen <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:dave@luciddg.com" target=3D"_blank">dave@luciddg.com</a>&gt=
;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">We&#39;re testing expanding=
 a 4-node cluster into an 8-node cluster, and we keep running into issues w=
ith the repair process near the end.<div>
<br></div><div>We&#39;re bringing up nodes 1-by-1 into the cluster, retoken=
ing nodes for an 8-node configuration, running nodetool cleanup on the node=
s after each retokening, and then increasing the replication factor to 5. T=
his all works without issue, and the cluster appears to be healthy in that =
8-node configuration with a replication factor of 5.</div>

<div><br>However, when we then run nodetool repair on the nodes, it will at=
 some point stall, even when being run on one of the new nodes.</div><div><=
br></div><div>It doesn&#39;t appear to stall while it&#39;s performing a co=
mpaction or transferring CF data. We&#39;ve monitored compactionstats and n=
etstats closely, and things always stall when a repair command is started, =
ie:</div>

<div><br></div><div>[2013-10-02 23:19:39,254] Starting repair command #9, r=
epairing 5 ranges for keyspace ourkeyspace<br></div><div><br></div><div>The=
 last message from AntiEntropyService is usually something to the effect of=
:</div>

<div><br></div><div>&lt;190&gt;Oct =A03 00:01:02 <a href=3D"http://myhost.c=
om" target=3D"_blank">myhost.com</a> 1970947950 [AntiEntropySessions:24] IN=
FO =A0org.apache.cassandra.service.AntiEntropyService =A0- [repair #9b17d31=
0-2bbd-11e3-0000-e06ec6c436ff] session completed successfully<br>

</div><div><br></div><div>... and then things don&#39;t start for the next =
repair. Nothing in the logs that looks related.<br><br>Where this occurs is=
 arbitrary. If I run on individual CFs within ourkeyspace, some will succee=
d, and some will fail, but if we start over and do the 4-node to 8-node exp=
ansion again, things will fail at a different place.<br>

<br>Advice as to what to look at next?</div><div><br></div><div>Thanks,</di=
v><div><br></div><div>Dave</div></div>
</blockquote></div><br></div>

--bcaec52c5b27a0f2c404e7f0ff5f--