cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Stevens <>
Subject Re: going down from RF=3 to RF=2, repair constantly falls over with JVM OOM
Date Fri, 05 Jul 2013 20:02:02 GMT
The following setting is probably not a good idea:
bloom_filter_fp_chance = 1.0

It would disable the bloom filters all together, and this setting doesn't
have appreciably greater benefits over a setting of 0.1 (which has the
advantage of saving you from disk I/O 90% of the time for keys which don't


On Thu, Jul 4, 2013 at 8:32 AM, Alain RODRIGUEZ <> wrote:

> @Michal: all true, a clean up would certainly remove a lot of useless data
> there, and I also advice Evan to do it. However, Evan may want to continue
> repairing his cluster as a routine operation an there is no reason a RF
> change shouldn't lead to this kind of issues.
> @Evan : With this amount of data, and not being using C*1.2, you could try
> tuning your bloom filters to use less memory. Let's say disabling them the
> time to recover from this issue : bloom_filter_fp_chance = 1.0 then
> upgrade sstables and retry repairing.
> This depends a lot of your needs and your context, but it might work if
> you can afford it.
> By the way, C* prior 1.2 should not exceed 300-500 GB per node. I read
> once that C*1.2 aims to reach 3-5 TB per node. Yet, horizontal scaling,
> using peer-to-peer is one of the main point of Cassandra. You might be
> carefull and scale when needed to never reach that much data per node.
> As always, please experts/commiters, correct me if I am wrong.
> Alain
> 2013/7/4 Michał Michalski <>
>> I don't think you need to run repair if you decrease RF. At least I
>> wouldn't do it.
>> In case of *decreasing* RF have 3 nodes containing some data, but only 2
>> of them should store them from now on, so you should rather run cleanup,
>> instead of repair, toget rid of the data on 3rd replica. And I guess it
>> should work (in terms of disk space and memory), if you've been able to
>> perform compaction.
>> Repair makes sense if you *increase* RF, so the data are streamed to the
>> new replicas.
>> M.
>> W dniu 04.07.2013 12:20, Evan Dandrea pisze:
>>  Hi,
>>> We've made the mistake of letting our nodes get too large, now holding
>>> about 3TB each. We ran out of enough free space to have a successful
>>> compaction, and because we're on 1.0.7, enabling compression to get
>>> out of the mess wasn't feasible. We tried adding another node, but we
>>> think this may have put too much pressure on the existing ones it was
>>> replicating from, so we backed out.
>>> So we decided to drop RF down to 2 from 3 to relieve the disk pressure
>>> and started building a secondary cluster with lots of 1 TB nodes. We
>>> ran repair -pr on each node, but it’s failing with a JVM OOM on one
>>> node while another node is streaming from it for the final repair.
>>> Does anyone know what we can tune to get the cluster stable enough to
>>> put it in a multi-dc setup with the secondary cluster? Do we actually
>>> need to wait for these RF3->RF2 repairs to stabilize, or could we
>>> point it at the secondary cluster without worry of data loss?
>>> We’ve set the heap on these two problematic nodes to 20GB, up from the
>>> equally too high 12GB, but we’re still hitting OOM. I had seen in
>>> other threads that tuning down compaction might help, so we’re trying
>>> the following:
>>> in_memory_compaction_limit_in_**mb 32 (down from 64)
>>> compaction_throughput_mb_per_**sec 8 (down from 16)
>>> concurrent_compactors 2 (the nodes have 24 cores)
>>> flush_largest_memtables_at 0.45 (down from 0.50)
>>> stream_throughput_outbound_**megabits_per_sec 300 (down from 400)
>>> reduce_cache_sizes_at 0.5 (down from 0.6)
>>> reduce_cache_capacity_to 0.35 (down from 0.4)
>>> -XX:**CMSInitiatingOccupancyFraction**=30
>>> Here’s the log from the most recent repair failure:
>>>**5843017/ <>
>>> The OOM starts at line 13401.
>>> Thanks for whatever insight you can provide.

View raw message