incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bart Swedrowski <b...@timedout.org>
Subject Re: Never ending manual repair after adding second DC
Date Mon, 16 Jul 2012 15:10:43 GMT
On 16 July 2012 11:25, aaron morton <aaron@thelastpickle.com> wrote:

> In the before time someone had problems with a switch/router that was
> dropping persistent but idle connections. Doubt this applies, and it would
> probably result in an error, just throwing it out there.
>

Yes, been through them few times.  There's literally no errors or warning
at all.  And sometimes, as aforementioned, there's actually INFO that
merkle tree has been sent where the other side is not receiving it.

Just now, I kicked off manual repair on node with IP 192.168.94.178 and
just got stuck on streaming files again.

Node 192.168.94.179:

Streaming from: /192.168.81.5
>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db
> sections=46 progress=0/5096 - 0%
>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db
> sections=244 progress=0/1548510 - 0%
>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db
> sections=228 progress=0/82859 - 0%


Node 192.168.81.5:

Streaming to: /192.168.94.179
>    /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2
> progress=168/168 - 100%
>    /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244
> progress=0/1548510 - 0%
>    /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46
> progress=0/5096 - 0%
>    /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228
> progress=0/82859 - 0%


Looks like streaming this specific SSTable hasn't finished (or been ACKed
on the other side)

   /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2
> progress=168/168 - 100%


This morning I've tightend monitoring so now we've each node monitoring
each other with ICMP packets (20 every minute) and monitoring is silent; no
issues reported since the morning, not a single packet lost.

I got some help from Acunu guys, first we believed we fixed the problem by
disabling bonding on the servers and blamed it for messing up stuff with
interrupts however this morning problem resurfaced.

I can see (and Acunu says) everything is pointing to network related
problem (although I'd expect IP stack to correct simple PL) but there's no
way to back this up (unless only Cassandra related traffic is getting lost
but *how* to monitor for it???).

Honestly, running out of ideas - further advice highly appreciated.

Mime
View raw message