incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Coli <>
Subject Re: Question regarding the need to run nodetool repair
Date Fri, 16 Nov 2012 00:56:32 GMT
On Thu, Nov 15, 2012 at 4:12 PM, Dwight Smith
<> wrote:
> I have a 4 node cluster,  version 1.1.2, replication factor of 4, read/write
> consistency of 3, level compaction. Several questions.

Hinted Handoff is broken in your version [1] (and all versions between
1.0.0 and 1.0.3 [2]). Upgrade to 1.1.6 ASAP so that the answers below
actually apply, because working Hinted Handoff is involved.

> 1)      Should nodetool repair be run regularly to assure it has completed
> before gc_grace?  If it is not run, what are the exposures?

If you do DELETE logical operations, yes. If not, no. gc_grace_seconds
only applies to tombstones, and if you do not delete you have no
tombstones. If you only DELETE in one columnfamily, that is the only
one you have to repair within gc_grace.

Exposure is zombie data, where a node missed a DELETE (and associated
tombstone) but had a previous value for that column or row and this
zombie value is resurrected and propagated by read repair.

> 2)      If a node goes down, and is brought back up prior to the 1 hour
> hinted handoff expiration, should repair be run immediately?

In theory, if hinted handoff is working, no. This is a good thing
because otherwise simply restarting a node would trigger the need for
repair. In practice I would be shocked if anyone has scientifically
tested it to the degree required to be certain all edge cases are
covered, so I'm not sure I would rely on this being true. Especially
as key components of this guarantee such as Hinted Handoff can be
broken for 3-5 point releases before anyone notices.

It is because of this uncertainty that I recommend periodic repair
even in clusters that don't do DELETE.

> 3)      If the hinted handoff has expired, the plan is to remove the node
> and start a fresh node in its place.  Does this approach cause problems?


1) You've lost any data that was only ever replicated to this node.
With RF>=3, this should be relatively rare, even with CL.ONE, because
writes are much more likely to succeed-but-report-they-failed than
vice versa. If you run periodic repair, you cover the case where
something gets under-replicated and then even less replicated as nodes
are replaced.
2) When you replace the node in its place (presumably using
replace_token) you will only stream the relevant data from a single
other replica. This means that, given 3 nodes A B C where datum X is
on A and B, and B fails, it might be bootstrapped using C as a source,
decreasing your replica count of X by 1.

In order to deal with these issues, you need to run a repair of the
affected node after bootstrapping/replace_tokening. Until this repair
completes, CL.ONE reads might be stale or missing. I think what
operators really want is a path by which they can bootstrap and then
repair, before returning the node to the cluster. Unfortunately there
are significant technical reasons which prevent this from being

As such, I suggest increasing gc_grace_seconds and
max_hint_window_in_ms to reduce the amount of repair you need to run.
The negative to increasing gc_grace is that you store tombstones for
longer before purging them. The negative to increasing
max_hint_window_in_ms is that hints for a given token are stored in
one row.. and very wide rows can exhibit pathological behavior.

Also if you set max_hint_window_in_ms too high, you could cause
cascading failure as nodes fill with hints, become less performant...
thereby increasing the cluster-wide hint rate. Unless you have a very
high write rate or really lazy ops people who leave nodes down for
very long times, the cascading failure case is relatively unlikely.



=Robert Coli
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

View raw message