Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1C397D337 for ; Fri, 13 Jul 2012 10:17:44 +0000 (UTC) Received: (qmail 24465 invoked by uid 500); 13 Jul 2012 10:17:41 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 24314 invoked by uid 500); 13 Jul 2012 10:17:41 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 24304 invoked by uid 99); 13 Jul 2012 10:17:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jul 2012 10:17:41 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FSL_RCVD_USER,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bart@timedout.org designates 188.40.104.168 as permitted sender) Received: from [188.40.104.168] (HELO tic.timedout.org) (188.40.104.168) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jul 2012 10:17:33 +0000 Received: from mail-vc0-f172.google.com ([209.85.220.172]) by tic.timedout.org with esmtpsa (TLSv1:RC4-SHA:128) (Exim 4.72) (envelope-from ) id 1Spcvu-0007P6-Ov for user@cassandra.apache.org; Fri, 13 Jul 2012 11:17:10 +0100 Received: by vcbfo14 with SMTP id fo14so538759vcb.31 for ; Fri, 13 Jul 2012 03:17:09 -0700 (PDT) Received: by 10.220.240.18 with SMTP id ky18mr223913vcb.74.1342174629908; Fri, 13 Jul 2012 03:17:09 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.231.194 with HTTP; Fri, 13 Jul 2012 03:16:49 -0700 (PDT) From: Bart Swedrowski Date: Fri, 13 Jul 2012 11:16:49 +0100 Message-ID: Subject: Never ending manual repair after adding second DC To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hello everyone, I'm facing quite weird problem with Cassandra since we've added secondary DC to our cluster and have totally ran out of ideas; this email is a call for help/advice! History looks like: - we used to have 4 nodes in a single DC - running Cassandra 0.8.7 - RF:3 - around 50GB of data on each node - randomPartitioner and SimpleSnitch All was working fine for over 9 months. Few weeks ago we decided we want to add another 4 nodes in a second DC and join them to the cluster. Prior doing that, we upgraded Cassandra to 1.0.9 to push it out of the doors before the multi-DC work. After upgrade, we left it working for over a week and it was all good; no issues. Then, we added 4 additional nodes in another DC bringing the cluster to 8 nodes in total, spreading across two DCs, so now we've: - 8 nodes across 2 DCs, 4 in each DC - 100Mbps low-latency connection (sub 5ms) running over Cisco ASA Site-to-Site VPN (which is ikev1 based) - DC1:3,DC2:3 RFs - randomPartitioner and using PropertyFileSnitch now nodetool ring looks as follows: $ nodetool -h localhost ring Address DC Rack Status State Load Owns Token 148873535527910577765226390751398592512 192.168.81.2 DC1 RC1 Up Normal 37.9 GB 12.50% 0 192.168.81.3 DC1 RC1 Up Normal 35.32 GB 12.50% 21267647932558653966460912964485513216 192.168.81.4 DC1 RC1 Up Normal 39.51 GB 12.50% 42535295865117307932921825928971026432 192.168.81.5 DC1 RC1 Up Normal 19.42 GB 12.50% 63802943797675961899382738893456539648 192.168.94.178 DC2 RC1 Up Normal 40.72 GB 12.50% 85070591730234615865843651857942052864 192.168.94.179 DC2 RC1 Up Normal 30.42 GB 12.50% 106338239662793269832304564822427566080 192.168.94.180 DC2 RC1 Up Normal 30.94 GB 12.50% 127605887595351923798765477786913079296 192.168.94.181 DC2 RC1 Up Normal 12.75 GB 12.50% 148873535527910577765226390751398592512 (please ignore the fact that nodes are not interleaved; they should be however there's been hiccup during the implementation phase. Unless *this* is the problem!) Now, the problem: over 7 out of 10 manual repairs are not being finished. They usually get stuck and show 3 different sympoms: 1). Say node 192.168.81.2 runs manual repair, it requests merkle trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179, 192.168.94.181. It receives them from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not from 192.168.94.181. 192.168.94.181 logs are saying that it has sent the merkle tree back but it's never received by 192.168.81.2. 2). Say node 192.168.81.2 runs manual repair, it requests merkle trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179, 192.168.94.181. It receives them from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not from 192.168.94.181. 192.168.94.181 logs are not saying *anything* about merkle tree being sent. Also compactionstats are not even saying anything about them being validated (generated) 3). Merkle trees are being delivered, and nodes are sending data across to sync theirselves. On a certain occasions, they'll get "stuck" streaming files between each other at 100% and won't move forward. Now the interesting bit is, the ones that are getting stuck are always placed in different DCs! Now, pretty much every single scenario points towards connectivity problem, however we also have few PostgreSQL replication streams happening over this connection, some other traffic going over and quite a lot of monitoring happening and none of those are being affected in any way. Also, if random packets are being lost, I'd expect TCP to correct that (re-transmit them). It doesn't matter whether its manual repair or just -pr repair, both end with pretty much the same. Has anyone came across this kind of issue before or have any ideas how else I could investigate this? The issue is pressing me massively as this is our live cluster and I've to run manual repairs pretty much manually (usually multiple times before it finally goes through) every single day=E2=80=A6 And also I'm not sure whether cluster is getting affec= ted in any other way BTW. I've gone through Jira issues and considered upgrading to 1.1.X but I can't see anything that would even look like something that is happening to my cluster. If any further information, like logs, configuration files are needed, please let me know. Any informations, suggestions, advices - greatly appreciated. Kind regards, Bart