Check the logs for messages about nodes going up and down, and also look at the MessagingService MBean for timeouts. If the node in DR 2 times out replying to DR1 the DR1 node will store a hint. 

Also when hints are stored they are TTL'd to the gc_grace_seconds for the CF (IIRC). If that's low the hints may not have been delivered. 

Am not aware of any specific tracking for failed hints other than log messages. 


Aaron Morton
New Zealand

Co-Founder & Principal Consultant
Apache Cassandra Consulting

On 28/09/2013, at 12:01 AM, Oleg Dulin <> wrote:

Here is some more information.

I am running full repair on one of the nodes and I am observing strange behavior.

Both DCs were up during the data load. But repair is reporting a lot of out-of-sync data. Why would that be ? Is there a way for me to tell that WAN may be dropping hinted handoff traffic ?


On 2013-09-27 10:35:34 +0000, Oleg Dulin said:

Wanted to add one more thing:
I can also tell that the numbers are not consistent across DRs this way -- I have a column family with really wide rows (a couple million columns).
DC1 reports higher column counts than DC2. DC2 only becomes consistent after I do the command a couple of times and trigger a read-repair. But why would nodetool repair logs show that everything is in sync ?
On 2013-09-27 10:23:45 +0000, Oleg Dulin said:
Consider this output from nodetool ring:
Address         DC          Rack        Status State   Load            Effective-Ownership Token
dc1.5      DC1 RAC1        Up     Normal  32.07 GB        50.00%       0
dc2.100    DC2 RAC1        Up     Normal  8.21 GB         50.00%        100
dc1.6      DC1 RAC1        Up     Normal  32.82 GB        50.00%        42535295865117307932921825928971026432
dc2.101    DC2 RAC1        Up     Normal  12.41 GB        50.00%        42535295865117307932921825928971026532
dc1.7      DC1 RAC1        Up     Normal  28.37 GB        50.00%        85070591730234615865843651857942052864
dc2.102    DC2 RAC1        Up     Normal  12.27 GB        50.00%        85070591730234615865843651857942052964
dc1.8      DC1 RAC1        Up     Normal  27.34 GB        50.00%        127605887595351923798765477786913079296
dc2.103    DC2 RAC1        Up     Normal  13.46 GB        50.00%        127605887595351923798765477786913079396
I concealed IPs and DC names for confidentiality.
All of the data loading was happening against DC1 at a pretty brisk rate, of, say, 200K writes per minute.
Note how my tokens are offset by 100. Shouldn't that mean that load on each node should be roughly identical ? In DC1 it is roughly around 30 G on each node. In DC2 it is almost 1/3rd of the nearest DC1 node by token range.
To verify that the nodes are in sync, I ran nodetool -h localhost repair MyKeySpace --partitioner-range on each node in DC2. Watching the logs, I see that the repair went really quick and all column families are in sync!
I need help making sense of this. Is this because DC1 is not fully compacted ? Is it because DC2 is not fully synced and I am not checking correctly ? How can I tell that there is still replication going on in progress (note, I started my load yesterday at 9:50am).

Oleg Dulin