incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Never ending manual repair after adding second DC
Date Mon, 16 Jul 2012 18:18:21 GMT
Even if it is a network error it would be good to detect it. 

If you can run a small repair with those log settings I'll can take a look at the logs if
you want. Cannot promise anything but another set of eyes may help. 

Ping me off list if you want to send me the logs. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/07/2012, at 4:32 AM, Bill Au wrote:

> I had ran into the same problem before:
> 
> http://comments.gmane.org/gmane.comp.db.cassandra.user/25334
> 
> I have not fond any solutions yet.
> 
> Bill
> 
> On Mon, Jul 16, 2012 at 11:10 AM, Bart Swedrowski <bart@timedout.org> wrote:
> 
> 
> On 16 July 2012 11:25, aaron morton <aaron@thelastpickle.com> wrote:
> In the before time someone had problems with a switch/router that was dropping persistent
but idle connections. Doubt this applies, and it would probably result in an error, just throwing
it out there.
> 
> Yes, been through them few times.  There's literally no errors or warning at all.  And
sometimes, as aforementioned, there's actually INFO that merkle tree has been sent where the
other side is not receiving it.
> 
> Just now, I kicked off manual repair on node with IP 192.168.94.178 and just got stuck
on streaming files again.
> 
> Node 192.168.94.179:
> 
> Streaming from: /192.168.81.5
>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 progress=0/5096
- 0%
>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 progress=0/1548510
- 0%
>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 progress=0/82859
- 0%
> 
> Node 192.168.81.5:
> 
> Streaming to: /192.168.94.179
>    /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 progress=168/168 -
100%
>    /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 progress=0/1548510
- 0%
>    /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 progress=0/5096 -
0%
>    /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 progress=0/82859
- 0%
> 
> Looks like streaming this specific SSTable hasn't finished (or been ACKed on the other
side)
> 
>    /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 progress=168/168 -
100%
> 
> This morning I've tightend monitoring so now we've each node monitoring each other with
ICMP packets (20 every minute) and monitoring is silent; no issues reported since the morning,
not a single packet lost.
> 
> I got some help from Acunu guys, first we believed we fixed the problem by disabling
bonding on the servers and blamed it for messing up stuff with interrupts however this morning
problem resurfaced.
> 
> I can see (and Acunu says) everything is pointing to network related problem (although
I'd expect IP stack to correct simple PL) but there's no way to back this up (unless only
Cassandra related traffic is getting lost but *how* to monitor for it???).
> 
> Honestly, running out of ideas - further advice highly appreciated.
> 


Mime
View raw message