On 16 July 2012 11:25, aaron morton <aaron@thelastpickle.com> wrote:
In the before time someone had problems with a switch/router that was dropping persistent but idle connections. Doubt this applies, and it would probably result in an error, just throwing it out there.

Yes, been through them few times.  There's literally no errors or warning at all.  And sometimes, as aforementioned, there's actually INFO that merkle tree has been sent where the other side is not receiving it.

Just now, I kicked off manual repair on node with IP 192.168.94.178 and just got stuck on streaming files again.

Node 192.168.94.179:

Streaming from: /192.168.81.5
   Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 progress=0/5096 - 0%
   Medals: /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 progress=0/1548510 - 0%
   Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 progress=0/82859 - 0%

Node 192.168.81.5:

Streaming to: /192.168.94.179
   /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 progress=168/168 - 100%
   /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 progress=0/1548510 - 0%
   /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 progress=0/5096 - 0%
   /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 progress=0/82859 - 0%

Looks like streaming this specific SSTable hasn't finished (or been ACKed on the other side)

   /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 progress=168/168 - 100%

This morning I've tightend monitoring so now we've each node monitoring each other with ICMP packets (20 every minute) and monitoring is silent; no issues reported since the morning, not a single packet lost.

I got some help from Acunu guys, first we believed we fixed the problem by disabling bonding on the servers and blamed it for messing up stuff with interrupts however this morning problem resurfaced.

I can see (and Acunu says) everything is pointing to network related problem (although I'd expect IP stack to correct simple PL) but there's no way to back this up (unless only Cassandra related traffic is getting lost but *how* to monitor for it???).

Honestly, running out of ideas - further advice highly appreciated.