Hi Sylvain,

might I ask why repair cannot simply ignore anything that is older than gc-grace? (like Aaron proposed)  I agree that repair should not process any tombstones or anything. But in my mind it sounds reasonable to make repair ignore timed-out data. Because the timestamp is created on the client, there is no reason to repair these, right?

We are using TTLs quite heavily and I was noticing that every repair increases the load of all nodes by 1-2 GBs, where each node has about 20-30GB of data. I dont know if this increases with the data-volume. The data is mostly time-series data.
I even noticed an increase when running two repairs directly after each other. So even when data was just repaired, there is still data being transferred. I assume this is due some columns timing out within that timeframe and the entire row being repaired.

regards,
Christian

On Thu, Nov 1, 2012 at 9:43 AM, Sylvain Lebresne <sylvain@datastax.com> wrote:
> Is this a feature or a bug?

Neither really. Repair doesn't do any gcable tombstone collection and
it would be really hard to change that (besides, it's not his job). So
if you when you run repair there is sstable with tombstone that could
be collected but are not yet, then yes, they will be streamed. Now the
theory is that compaction will run often enough that gcable tombstone
will be collected in a reasonably timely fashion and so you will never
have lots of such tombstones in general (making the fact that repair
stream them largely irrelevant). That being said, in practice, I don't
doubt that there is a few scenario like your own where this still can
lead to doing too much useless work.

I believe the main problem is that size tiered compaction has a
tendency to not compact the largest sstables very often. Meaning that
you could have large sstable with mostly gcable tombstone sitting
around. In the upcoming Cassandra 1.2,
https://issues.apache.org/jira/browse/CASSANDRA-3442 will fix that.
Until then, if you are no afraid of a little bit of scripting, one
option could be before running a repair to run a small script that
would check the creation time of your sstable. If an sstable is old
enough (for some value of that that depends on what is the TTL you use
on all your columns), you may want to force a compaction (using the
JMX call forceUserDefinedCompaction()) of that sstable. The goal being
to get read of a maximum of outdated tombstones before running the
repair (you could also alternatively run a major compaction prior to
the repair, but major compactions have a lot of nasty effect so I
wouldn't recommend that a priori).

--
Sylvain