@Michal: all true, a clean up would certainly remove a lot of useless data there, and I also advice Evan to do it. However, Evan may want to continue repairing his cluster as a routine operation an there is no reason a RF change shouldn't lead to this kind of issues.@Evan : With this amount of data, and not being using C*1.2, you could try tuning your bloom filters to use less memory. Let's say disabling them the time to recover from this issue : bloom_filter_fp_chance = 1.0 then upgrade sstables and retry repairing.This depends a lot of your needs and your context, but it might work if you can afford it.By the way, C* prior 1.2 should not exceed 300-500 GB per node. I read once that C*1.2 aims to reach 3-5 TB per node. Yet, horizontal scaling, using peer-to-peer is one of the main point of Cassandra. You might be carefull and scale when needed to never reach that much data per node.As always, please experts/commiters, correct me if I am wrong.Alain2013/7/4 Michał Michalski <firstname.lastname@example.org>
I don't think you need to run repair if you decrease RF. At least I wouldn't do it.
In case of *decreasing* RF have 3 nodes containing some data, but only 2 of them should store them from now on, so you should rather run cleanup, instead of repair, toget rid of the data on 3rd replica. And I guess it should work (in terms of disk space and memory), if you've been able to perform compaction.
Repair makes sense if you *increase* RF, so the data are streamed to the new replicas.
W dniu 04.07.2013 12:20, Evan Dandrea pisze:
We've made the mistake of letting our nodes get too large, now holding
about 3TB each. We ran out of enough free space to have a successful
compaction, and because we're on 1.0.7, enabling compression to get
out of the mess wasn't feasible. We tried adding another node, but we
think this may have put too much pressure on the existing ones it was
replicating from, so we backed out.
So we decided to drop RF down to 2 from 3 to relieve the disk pressure
and started building a secondary cluster with lots of 1 TB nodes. We
ran repair -pr on each node, but it’s failing with a JVM OOM on one
node while another node is streaming from it for the final repair.
Does anyone know what we can tune to get the cluster stable enough to
put it in a multi-dc setup with the secondary cluster? Do we actually
need to wait for these RF3->RF2 repairs to stabilize, or could we
point it at the secondary cluster without worry of data loss?
We’ve set the heap on these two problematic nodes to 20GB, up from the
equally too high 12GB, but we’re still hitting OOM. I had seen in
other threads that tuning down compaction might help, so we’re trying
in_memory_compaction_limit_in_mb 32 (down from 64)
compaction_throughput_mb_per_sec 8 (down from 16)
concurrent_compactors 2 (the nodes have 24 cores)
flush_largest_memtables_at 0.45 (down from 0.50)
stream_throughput_outbound_megabits_per_sec 300 (down from 400)
reduce_cache_sizes_at 0.5 (down from 0.6)
reduce_cache_capacity_to 0.35 (down from 0.4)
Here’s the log from the most recent repair failure:
The OOM starts at line 13401.
Thanks for whatever insight you can provide.