Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 94FF090FD for ; Mon, 16 Jul 2012 16:32:50 +0000 (UTC) Received: (qmail 47834 invoked by uid 500); 16 Jul 2012 16:32:48 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 47813 invoked by uid 500); 16 Jul 2012 16:32:48 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 47805 invoked by uid 99); 16 Jul 2012 16:32:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jul 2012 16:32:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bill.w.au@gmail.com designates 209.85.216.172 as permitted sender) Received: from [209.85.216.172] (HELO mail-qc0-f172.google.com) (209.85.216.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jul 2012 16:32:42 +0000 Received: by qcac10 with SMTP id c10so3840944qca.31 for ; Mon, 16 Jul 2012 09:32:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=x35t9rqMHWwC7cJb7fxU8PWxy8hc5XibK9t3hj0wbcg=; b=Q8kzPZne3Ygrb3bAGX6eX5ZESp1jBYz/423cd5GA4wscvtJ8BU+ExlCiCj7O2BC7nu /ZU15AcXkFnpjC0dAx4r0uW4Jmcpfsnpcvc8r7jUdC6uBJoP2cmd9PXxTJFLIRJK5Gpq XPXSR3URJuF2B374Xw1U63qnooiN3XOBH5bC+qou6367UecK5uQ87EjMucTO5KFa7/yQ WxvJrZuXCLHi5TTSf+4y0HH5YMwaBa2xZ62dT7upQzrgsEqzH5iZbWLeVrKehGhS84a8 sNbK8Org7wV3WCH1PtDVaMmMWcoJbMBDRY5Gvj3DoFA/xHnP8KrxyxZxbUnz7DSn94a+ p+nA== MIME-Version: 1.0 Received: by 10.60.171.174 with SMTP id av14mr16271483oec.61.1342456340773; Mon, 16 Jul 2012 09:32:20 -0700 (PDT) Received: by 10.182.143.1 with HTTP; Mon, 16 Jul 2012 09:32:20 -0700 (PDT) In-Reply-To: References: Date: Mon, 16 Jul 2012 12:32:20 -0400 Message-ID: Subject: Re: Never ending manual repair after adding second DC From: Bill Au To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec54a3252517b2f04c4f4f9e1 --bcaec54a3252517b2f04c4f4f9e1 Content-Type: text/plain; charset=ISO-8859-1 I had ran into the same problem before: http://comments.gmane.org/gmane.comp.db.cassandra.user/25334 I have not fond any solutions yet. Bill On Mon, Jul 16, 2012 at 11:10 AM, Bart Swedrowski wrote: > > > On 16 July 2012 11:25, aaron morton wrote: > >> In the before time someone had problems with a switch/router that was >> dropping persistent but idle connections. Doubt this applies, and it would >> probably result in an error, just throwing it out there. >> > > Yes, been through them few times. There's literally no errors or warning > at all. And sometimes, as aforementioned, there's actually INFO that > merkle tree has been sent where the other side is not receiving it. > > Just now, I kicked off manual repair on node with IP 192.168.94.178 and > just got stuck on streaming files again. > > Node 192.168.94.179: > > Streaming from: /192.168.81.5 >> Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db >> sections=46 progress=0/5096 - 0% >> Medals: /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db >> sections=244 progress=0/1548510 - 0% >> Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db >> sections=228 progress=0/82859 - 0% > > > Node 192.168.81.5: > > Streaming to: /192.168.94.179 >> /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 >> progress=168/168 - 100% >> /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 >> progress=0/1548510 - 0% >> /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 >> progress=0/5096 - 0% >> /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 >> progress=0/82859 - 0% > > > Looks like streaming this specific SSTable hasn't finished (or been ACKed > on the other side) > > /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 >> progress=168/168 - 100% > > > This morning I've tightend monitoring so now we've each node monitoring > each other with ICMP packets (20 every minute) and monitoring is silent; no > issues reported since the morning, not a single packet lost. > > I got some help from Acunu guys, first we believed we fixed the problem by > disabling bonding on the servers and blamed it for messing up stuff with > interrupts however this morning problem resurfaced. > > I can see (and Acunu says) everything is pointing to network related > problem (although I'd expect IP stack to correct simple PL) but there's no > way to back this up (unless only Cassandra related traffic is getting lost > but *how* to monitor for it???). > > Honestly, running out of ideas - further advice highly appreciated. > --bcaec54a3252517b2f04c4f4f9e1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I had ran into the same problem before:


I have not fond any solutions yet.

Bill
=

On Mon, Jul 16, 2012 at 11:10 AM,= Bart Swedrowski <bart@timedout.org> wrote:


On 16 July 2012 11:25, aaron morton <aaron@thelastpickle.c= om> wrote:
In the before time someone had problems= with a switch/router that was dropping persistent but idle connections. Do= ubt this applies, and it would probably result in an error, just throwing i= t out there.

Yes, been through them few times. = =A0There's literally no errors or warning at all. =A0And sometimes, as = aforementioned, there's actually INFO that merkle tree has been sent wh= ere the other side is not receiving it.

Just now, I kicked off manual repair on node with IP 19= 2.168.94.178 and just got stuck on streaming files again.


Streaming from: /192.168.= 81.5
=A0 =A0Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Dat= a.db sections=3D46 progress=3D0/5096 - 0%
=A0 =A0Medals: /var/lib/cassan= dra/data/Medals/dataa-hd-1128-Data.db sections=3D244 progress=3D0/1548510 -= 0%
=A0 =A0Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db section= s=3D228 progress=3D0/82859 - 0%

Node = 192.168.81.5:

Streaming to: /192.168.= 94.179
=A0 =A0/var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db s= ections=3D2 progress=3D168/168 - 100%
=A0 =A0/var/lib/cassandra/data/Med= als/dataa-hd-1128-Data.db sections=3D244 progress=3D0/1548510 - 0%
=A0 =A0/var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=3D46 p= rogress=3D0/5096 - 0%
=A0 =A0/var/lib/cassandra/data/Medals/dataa-hd-111= 9-Data.db sections=3D228 progress=3D0/82859 - 0%
Looks like streaming this specific SSTable hasn't finished (or been ACK= ed on the other side)

=A0 =A0/var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=3D2 pr= ogress=3D168/168 - 100%

This morning I'= ve tightend monitoring so now we've each node monitoring each other wit= h ICMP packets (20 every minute) and monitoring is silent; no issues report= ed since the morning, not a single packet lost.

I got some help from Acunu guys, first we believed we f= ixed the problem by disabling bonding on the servers and blamed it for mess= ing up stuff with interrupts however this morning problem resurfaced.

I can see (and Acunu says) everything is pointing to ne= twork related problem (although I'd expect IP stack to correct simple P= L) but there's no way to back this up (unless only Cassandra related tr= affic is getting lost but *how* to monitor for it???).

Honestly, running out of ideas - further advice highly = appreciated.

--bcaec54a3252517b2f04c4f4f9e1--