incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Cooper <>
Subject RE: Stalled streams during repairs
Date Thu, 17 Apr 2014 03:39:49 GMT
First, thanks for the quick reply and jira links!  Its helpful to know we are not the only
ones experiencing these issues.

"Are you sure you actually want/need to run repair as frequently as you currently are? Reducing
the frequency won't make it work any better, but it will reduce the number of times you have
to babysit its failure."

I think we are dealing with streaming issues specifically, we have been successfully running
repairs such that each node runs once a week in all of our clusters (to stay within gc_grace_seconds
per best practices).  In this particular case, we are trying to backfill data to a new second
datacenter from our first datacenter using manual repairs (total cluster load ~11TB).  It
is becoming more and more evident that the most reliable option at this point would be to
do an out-of-band rsync of a snapshot on dc1, with a custom sstable id de-duplication script
paired with a refresh/compaction/cleanup on dc2 nodes as in [1].  It should also be noted
that our initial plan (nodetool rebuild) failed on this cluster with a stack overrun, likely
due to the massive amount of CF's (2800+) we are running (an admitted data model issue that
is being worked out).

I would love to consider dropping scheduled anti-entropy repairs completely if we have enough
other fail-safes in place.  We run RF=3 and LOCAL_QUORUM reads/writes.  We also have read
repair chance set to 1.0 on most CFs (something we recently realized was carried over as a
default from the 0.8 days, this cluster is indeed that old...).  Our usage sees deletes, but
worst case if data came back, I suspect it would just trigger duplicate processing.  We did
notice our repair process timings went from about 8 hours in 1.1 to over 12 hours in 1.2.

Our biggest concern at this point is can we effectively rebuild a failed node with streaming/bootstrap
or do we need to devise custom workflows (like above mentioned rsync) to quickly and reliably
bring a node back to full load.  It sounds like there are some considerable improvements to
bootstrap/repair/streaming in 2.0, excluding the current performance problems with vnodes.

We are planning on upgrading to 2.0, but as with most things, this wont happen overnight.
 We obviously need to get to 1.2.16 as a pre-req to upgrade to 2.0 which will probably get
more priority now :)



View raw message