incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Dulin <oleg.du...@gmail.com>
Subject Frustration with "repair" process in 1.1.11
Date Fri, 01 Nov 2013 19:15:07 GMT
First I need to vent.

<rant>
One of my cassandra cluster is a dual data center setup, with DC1 
acting as primary, and DC2 acting as a hot backup.

Well, guess what ? I am pretty sure that it falls behind on 
replication. So I am told I need to run repair.

I run repair (with -pr) on DC2. First time I run it it gets *stuck* 
(i.e. frozen) within the first 30 seconds, with no error or any sort of 
message. I then run it again -- and it completes in seconds on each 
node, with about 50 gigs of data on each.

That seems suspicious, so I do some research.

I am told on IRC that running repair -pr will only do the repair on 
"100" tokens (the offset from DC1 to DC2)… Seriously ???

Repair process is, indeed, a joke: 
https://issues.apache.org/jira/browse/CASSANDRA-5396 . Repair is the 
worst thing you can do to your cluster, it consumes enormous resources, 
and can leave your cluster in an inconsistent state. Oh and by the way 
you must run it every week…. Whoever invented that process must not 
live in a real world, with real applications.
</rant>

No… lets have a constructive conversation.

How do I know, with certainty, that my DC2 cluster is up to date on 
replication ? I have a few options:

1) I set read repair chance to 100% on critical column families and I 
write a tool to scan every CF, every column of every row. This strikes 
me as very silly. 
Q1: Do I need to scan every column or is looking at one column enough 
to trigger a read repair ?

2) Can someone explain to me how the repair works such that I don't 
totally trash my cluster or spill into work week ?

Is there any improvement and clarity in 1.2 ? How about 2.0 ?



-- 
Regards,
Oleg Dulin
http://www.olegdulin.com
Mime
View raw message