incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Zhu <wz1...@yahoo.com>
Subject Re: (unofficial) Community Poll for Production Operators : Repair
Date Tue, 14 May 2013 16:10:30 GMT
1) 1.1.6 on 5 nodes, 24CPU, 72 RAM 
2) local quorum (we only have one DC though). We do delete through TTL 
3) yes 
4) once a week rolling repairs -pr using cron job 
5) it definitely has negative impact on the performance. Our data size is around 100G per
node and during repair it brings in additional 60G - 80G data and created about 7K compaction
(We use LCS with SSTable size of 10M which was a mistake we made at the beginning). It takes
more than a day for the compaction tasks to clear and by then the next compaction starts.
We had to set client side (Hector) timeout to deal with it and the SLA is still under control
for now. 
But we had to halt go live for another cluster due to the unanticipated "double" the space
during the repair. 

Per Dean's question to simulate the slow response, someone in the IRC mentioned a trick to
start Cassandra with -f and ctrl-z and it works for our test. 

-Wei 
----- Original Message -----

From: "Dean Hiller" <Dean.Hiller@nrel.gov> 
To: user@cassandra.apache.org 
Sent: Tuesday, May 14, 2013 4:48:02 AM 
Subject: Re: (unofficial) Community Poll for Production Operators : Repair 

We had to roll out a fix in cassandra as a slow node was slowing down our clients of cassandra
in 1.2.2 for some reason. Every time we had a slow node, we found out fast as performance
degraded. We tested this in QA and had the same issue. This means a repair made that node
slow which made our clients slow. With this fix which I think one our team is going to try
to get it back into cassandra, the slow node does not affect our clients anymore. 

I am curious though, if someone else would use the "tc" program to simulate linux packet delay
on a single node, does your client's response time get much slower? We simulated a 500ms delay
on the node to simulate the slow nodeā€¦.it seems the co-ordinator node was incorrectly waiting
for BOTH responses on CL_QUOROM instead of just one (as itself was one as well) or something
like that. (I don't know too much as my colleague was the one that debugged this issue) 

Dean 

From: Alain RODRIGUEZ <arodrime@gmail.com<mailto:arodrime@gmail.com>> 
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>

Date: Tuesday, May 14, 2013 1:42 AM 
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>

Subject: Re: (unofficial) Community Poll for Production Operators : Repair 

Hi Rob, 

1) 1.2.2 on 6 to 12 EC2 m1.xlarge 
2) Quorum R&W . Almost no deletes (just some TTL) 
3) Yes 
4) On each node once a week (rolling repairs using crontab) 
5) The only behavior that is quite odd or unexplained to me is why a repair doesn't fix a
counter mismatch between 2 nodes. I mean when I read my counters with a CL.One I have inconsistency
(the counter value may change anytime I read it, depending, I guess, on what node I read from.
Reading with CL.Quorum fixes this bug, but the data is still wrong on some nodes. About performance,
it's quite expensive to run a repair but doing it in a low charge period and in a rolling
fashion works quite well and has no impact on the service. 

Hope this will help somehow. Let me know if you need more information. 

Alain 



2013/5/10 Robert Coli <rcoli@eventbrite.com<mailto:rcoli@eventbrite.com>> 
Hi! 

I have been wondering how Repair is actually used by operators. If 
people operating Cassandra in production could answer the following 
questions, I would greatly appreciate it. 

1) What version of Cassandra do you run, on what hardware? 
2) What consistency level do you write at? Do you do DELETEs? 
3) Do you run a regularly scheduled repair? 
4) If you answered "yes" to 3, what is the frequency of the repair? 
5) What has been your subjective experience with the performance of 
repair? (Does it work as you would expect? Does its overhead have a 
significant impact on the performance of your cluster?) 

Thanks! 

=Rob 



Mime
View raw message