Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 16911F1C1 for ; Fri, 4 Oct 2013 21:30:11 +0000 (UTC) Received: (qmail 96384 invoked by uid 500); 4 Oct 2013 21:30:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 96251 invoked by uid 500); 4 Oct 2013 21:30:02 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 96237 invoked by uid 99); 4 Oct 2013 21:30:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Oct 2013 21:30:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dave@luciddg.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bk0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Oct 2013 21:29:54 +0000 Received: by mail-bk0-f48.google.com with SMTP id my13so1732432bkb.7 for ; Fri, 04 Oct 2013 14:29:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=/+gPYca8WayBxj5DTxqr8T7kisNES6v+9+PFRZCFCyY=; b=dSLUkqrwunDTxwKuAuN8dtdqsxXpeWFeUFTJceaXHAEJqwjf3USiLqHZWVYQPSY7On We0IEOmquwkxWvCuXllrV5D+UcmLk7QchAy3uxXJiVvJ7+0ZcAGjp9oLyAlA2bClxa0V XOWtStuN/N5OLtDRBfjINWsufYtuOnVSWs1UPGtg16Zr22Oztz52uJ8gKlrDL3Q/40rf YIWxL4sn82qhL8wvSUqfSExhlJqHzSAa5tdXoRoftgylZRS6R1LRieKrpXWfyweIrVAg tjB8TtEoRRrd87XQciFnHrin8y0qt1NIL0xM90aWdR562TJKHntPQjpfPti3PWgRwSdj UVoQ== X-Gm-Message-State: ALoCoQlNj//L1echCAMhmw8WluHRtES2oU7cXl2+29XE/UadNcbtCdIkIFErCGrA1ujFV7qiGfLp MIME-Version: 1.0 X-Received: by 10.205.35.15 with SMTP id su15mr15055117bkb.21.1380922173732; Fri, 04 Oct 2013 14:29:33 -0700 (PDT) Received: by 10.205.24.2 with HTTP; Fri, 4 Oct 2013 14:29:33 -0700 (PDT) In-Reply-To: References: Date: Fri, 4 Oct 2013 14:29:33 -0700 Message-ID: Subject: Re: nodetool repair fails after expansion From: Dave Cowen To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec52c5b27a0f2c404e7f0ff5f X-Virus-Checked: Checked by ClamAV on apache.org --bcaec52c5b27a0f2c404e7f0ff5f Content-Type: text/plain; charset=ISO-8859-1 I should clarify that we are running Cassandra 1.1.12. Dave On Fri, Oct 4, 2013 at 2:08 PM, Dave Cowen wrote: > We're testing expanding a 4-node cluster into an 8-node cluster, and we > keep running into issues with the repair process near the end. > > We're bringing up nodes 1-by-1 into the cluster, retokening nodes for an > 8-node configuration, running nodetool cleanup on the nodes after each > retokening, and then increasing the replication factor to 5. This all works > without issue, and the cluster appears to be healthy in that 8-node > configuration with a replication factor of 5. > > However, when we then run nodetool repair on the nodes, it will at some > point stall, even when being run on one of the new nodes. > > It doesn't appear to stall while it's performing a compaction or > transferring CF data. We've monitored compactionstats and netstats closely, > and things always stall when a repair command is started, ie: > > [2013-10-02 23:19:39,254] Starting repair command #9, repairing 5 ranges > for keyspace ourkeyspace > > The last message from AntiEntropyService is usually something to the > effect of: > > <190>Oct 3 00:01:02 myhost.com 1970947950 [AntiEntropySessions:24] INFO > org.apache.cassandra.service.AntiEntropyService - [repair > #9b17d310-2bbd-11e3-0000-e06ec6c436ff] session completed successfully > > ... and then things don't start for the next repair. Nothing in the logs > that looks related. > > Where this occurs is arbitrary. If I run on individual CFs within > ourkeyspace, some will succeed, and some will fail, but if we start over > and do the 4-node to 8-node expansion again, things will fail at a > different place. > > Advice as to what to look at next? > > Thanks, > > Dave > --bcaec52c5b27a0f2c404e7f0ff5f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I should clarify that we are running Cassandra 1.1.12.
Dave


On Fri, Oct 4, 2013 at 2:08 PM, Dave Cowen <dave@luciddg.com>= ; wrote:
We're testing expanding= a 4-node cluster into an 8-node cluster, and we keep running into issues w= ith the repair process near the end.

We're bringing up nodes 1-by-1 into the cluster, retoken= ing nodes for an 8-node configuration, running nodetool cleanup on the node= s after each retokening, and then increasing the replication factor to 5. T= his all works without issue, and the cluster appears to be healthy in that = 8-node configuration with a replication factor of 5.

However, when we then run nodetool repair on the nodes, it will at= some point stall, even when being run on one of the new nodes.
<= br>
It doesn't appear to stall while it's performing a co= mpaction or transferring CF data. We've monitored compactionstats and n= etstats closely, and things always stall when a repair command is started, = ie:

[2013-10-02 23:19:39,254] Starting repair command #9, r= epairing 5 ranges for keyspace ourkeyspace

The= last message from AntiEntropyService is usually something to the effect of= :

<190>Oct =A03 00:01:02 myhost.com 1970947950 [AntiEntropySessions:24] IN= FO =A0org.apache.cassandra.service.AntiEntropyService =A0- [repair #9b17d31= 0-2bbd-11e3-0000-e06ec6c436ff] session completed successfully

... and then things don't start for the next = repair. Nothing in the logs that looks related.

Where this occurs is= arbitrary. If I run on individual CFs within ourkeyspace, some will succee= d, and some will fail, but if we start over and do the 4-node to 8-node exp= ansion again, things will fail at a different place.

Advice as to what to look at next?

Thanks,

Dave

--bcaec52c5b27a0f2c404e7f0ff5f--