From user-return-37943-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Dec 10 07:21:22 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E3DBA10DA4 for ; Tue, 10 Dec 2013 07:21:21 +0000 (UTC) Received: (qmail 5717 invoked by uid 500); 10 Dec 2013 07:21:15 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 5537 invoked by uid 500); 10 Dec 2013 07:21:10 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 5416 invoked by uid 99); 10 Dec 2013 07:21:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Dec 2013 07:21:08 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.160.47] (HELO mail-pb0-f47.google.com) (209.85.160.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Dec 2013 07:21:03 +0000 Received: by mail-pb0-f47.google.com with SMTP id um1so7008271pbc.6 for ; Mon, 09 Dec 2013 23:20:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=xFxMmf4R714kfVE1qknOOO+cj3CTP11iGKkXaanWSaU=; b=bw4f6NrL83dcui4zCJUZbgHD65r240dVRmNchtbhqBBF+gBbsPnonCcdPHNPHV9X87 MSIdedjbC5ytOJ2lED24ICLPRIzdJK3h8/14Q9h+GHruouk/cUXZFgJYQroiS3+UWhNq rGHcVnuD4whPhr90ANzvXtK8fESCUo+xyXBHkEFiVF/Y0UzNoWYuZU/CPsuQ+TdOR856 q9sYTJgFNp2EUvicdDwVb78KISaqNi1hv8N8ntPNbcJYwd6PmzEozp1ZyF0Z+k4xkJTb 8Td9Jh4Ea01RQQhFF/N9U/Xi2i38wbeKRlXQn6HJQAqo+B4+g7zd7JJCYpyJoT/CEDF0 vhQg== X-Gm-Message-State: ALoCoQlGo94XWg//wDCmr5xvK7A5PsdAb7eVPNn1xResg92siaTmYpHzfzMnuk+YwXhhRqaqYx3W X-Received: by 10.68.164.131 with SMTP id yq3mr25915070pbb.56.1386660043294; Mon, 09 Dec 2013 23:20:43 -0800 (PST) Received: from [172.16.1.18] ([203.86.207.101]) by mx.google.com with ESMTPSA id i10sm32178274pat.11.2013.12.09.23.20.39 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 09 Dec 2013 23:20:42 -0800 (PST) From: Aaron Morton Content-Type: multipart/alternative; boundary="Apple-Mail=_CBD13B44-55CE-40BD-9374-398B1E8EE8BA" Message-Id: <830BFA22-8996-4819-8E64-75D01D776E22@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1822\)) Subject: Re: Repair hangs - Cassandra 1.2.10 Date: Tue, 10 Dec 2013 20:20:30 +1300 References: To: Cassandra User In-Reply-To: X-Mailer: Apple Mail (2.1822) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_CBD13B44-55CE-40BD-9374-398B1E8EE8BA Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > I changed logging to debug level, but still nothing is logged.=20 > Again - any help will be appreciated.=20 There is nothing at the ERROR level on any machine ? check nodetool compactionstats to see if a validation compaction is = running, the repair may be waiting on this.=20 check nodetool netstats to see if streams are being exchanged, then = check the logs on those machines.=20 cheers ----------------- Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 4/12/2013, at 10:24 pm, Tamar Rosen wrote: > Update - I am still experiencing the above issues, but not all the = time. I was able to run repair (on this keyspace) from node 2 and from = node 4, but now a different keyspace hangs on these nodes, and I am = still not able to run repair on node 1. It seems random. I changed = logging to debug level, but still nothing is logged.=20 > Again - any help will be appreciated.=20 >=20 > Tamar >=20 >=20 > On Mon, Dec 2, 2013 at 11:53 AM, Tamar Rosen = wrote: > Hi, >=20 > On AWS, we had a 2 node cluster with RF 2.=20 > We added 2 more nodes, then changed RF to 3 on all our keyspaces.=20 > Next step was to run nodetool repair, node by node.=20 > (In the meantime, we found that we must use CL quorum, which is = affecting our application's performance). > Started with node 1, which is one of the old nodes. > Ran: > nodetool repair -pr >=20 > It seemed to be progressing fine, running keyspace by keyspace, for = about an hour, but then it hung. The last messages in the output are: > =20 > [2013-12-01 11:18:24,577] Repair command #4 finished > [2013-12-01 11:18:24,594] Starting repair command #5, repairing 230 = ranges for keyspace correlor_customer_766 >=20 > It stayed like this for almost 24 hours. Then we read about the = possibility of this being related to not upgrading sstables, so we = killed the process. We were not sure whether we had run upgrade sstables = (we upgraded from 1.2.4 a couple of months ago) =20 >=20 > So: > Ran upgradesstables on a specific table in the keyspace that repair = got stuck on. (this was fast) > nodetool upgradesstables correlor_customer_766 users > Ran repair on that same table.=20 > nodetool repair correlor_customer_766 users -pr >=20 > This is again hanging.=20 > The first and only output from this process is: > [2013-12-02 08:22:41,221] Starting repair command #6, repairing 230 = ranges for keyspace correlor_customer_766 >=20 > Nothing else happened for more than an hour.=20 >=20 > Any help and advice will be greatly appreciated. >=20 > Tamar Rosen >=20 > correlor.com >=20 >=20 > =20 >=20 >=20 >=20 --Apple-Mail=_CBD13B44-55CE-40BD-9374-398B1E8EE8BA Content-Transfer-Encoding: 7bit Content-Type: text/html; charset=iso-8859-1
I changed logging to debug level, but still nothing is logged. 
Again - any help will be appreciated. 
There is nothing at the ERROR level on any machine ?

check nodetool compactionstats to see if a validation compaction is running, the repair may be waiting on this. 

check nodetool netstats to see if streams are being exchanged, then check the logs on those machines. 

cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting

On 4/12/2013, at 10:24 pm, Tamar Rosen <tamar@correlor.com> wrote:

Update - I am still experiencing the above issues, but not all the time. I was able to run repair (on this keyspace) from node 2 and from node 4, but now a different keyspace hangs on these nodes, and I am still not able to run repair on node 1. It seems random. I changed logging to debug level, but still nothing is logged.
Again - any help will be appreciated.

Tamar


On Mon, Dec 2, 2013 at 11:53 AM, Tamar Rosen <tamar@correlor.com> wrote:
Hi,

On AWS, we had a 2 node cluster with RF 2.
We added 2 more nodes, then changed RF to 3 on all our keyspaces.
Next step was to run nodetool repair, node by node.
(In the meantime, we found that we must use  CL quorum, which is affecting our application's performance).
Started with node 1, which is one of the old nodes.
Ran:
nodetool repair -pr

It seemed to be progressing fine, running keyspace by keyspace, for about an hour, but then it hung. The last messages in the output are:
 
[2013-12-01 11:18:24,577] Repair command #4 finished
[2013-12-01 11:18:24,594] Starting repair command #5, repairing 230 ranges for keyspace correlor_customer_766


It stayed like this for almost 24 hours. Then we read about the possibility of this being related to not upgrading sstables, so we killed the process. We were not sure whether we had run upgrade sstables (we upgraded from 1.2.4 a couple of months ago) 

So:
Ran upgradesstables on a specific table in the keyspace that repair got stuck on. (this was fast)
nodetool upgradesstables correlor_customer_766 users
Ran repair on that same table.
nodetool repair correlor_customer_766 users -pr

This is again hanging.
The first and only output from this process is:
[2013-12-02 08:22:41,221] Starting repair command #6, repairing 230 ranges for keyspace correlor_customer_766

Nothing else happened for more than an hour.

Any help and advice will be greatly appreciated.

Tamar Rosen



 




--Apple-Mail=_CBD13B44-55CE-40BD-9374-398B1E8EE8BA--