Oh yes, I forgot about the thread. I assume you are talking about: http://grokbase.com/t/cassandra/user/12ab6pbs5n/unnecessary-tombstones-transmission-during-repair-process
I think these are multiple issues that correlate with each other:
1) Repair uses the local timestamp of DeletedColumns for Merkle tree calculation. This is what the other thread was about.
Alexey claims that this was fixed by some other commit:https://issues.apache.org/jira/secure/attachment/12544204/CASSANDRA-4561-CS.patch
But honestly, I dont see how this solves it. I understand how Alexeys patch a few messages before would solve it (by overriding the updateDigest method in DeletedColumn)
2) ExpiringColumns should not be used for merkle tree calculation if they are timed out.
I checked LazilyCompactedRow and saw that it does not exclude any timed-out columns. It loops over all columns and calls updateDigest on them. Without any condition. Imho ExpiringColumn.updateDigest() should check for its own isMarkedForDelete() first before doing any digest-changes (We cannot simply call isMarkedDelete from LazilyCompactionRow because we dont want this for DeletedColumns).
3) Cassandra should not create tombstones for expiring columns.
I am not a 100% sure, but it looks to me like cassandra creates tombstones for expired ExpiringColumns. This makes me wonder if we could delete expired columns directly. The digest for a ExpiringColumn and DeletedColumn can never match, due to the different implementations. So there will be always a repair if compactions are not synchronous on nodes.
Imho it should be valid to delete ExpiringColumns directly, because the TTL is given by the client and should pass on all nodes at the same time.
All together should reduce over-repair.
Of course, rather over-repair than corrupt something.