cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <>
Subject Re: going down from RF=3 to RF=2, repair constantly falls over with JVM OOM
Date Thu, 04 Jul 2013 12:32:04 GMT
@Michal: all true, a clean up would certainly remove a lot of useless data
there, and I also advice Evan to do it. However, Evan may want to continue
repairing his cluster as a routine operation an there is no reason a RF
change shouldn't lead to this kind of issues.

@Evan : With this amount of data, and not being using C*1.2, you could try
tuning your bloom filters to use less memory. Let's say disabling them the
time to recover from this issue : bloom_filter_fp_chance = 1.0 then upgrade
sstables and retry repairing.

This depends a lot of your needs and your context, but it might work if you
can afford it.

By the way, C* prior 1.2 should not exceed 300-500 GB per node. I read once
that C*1.2 aims to reach 3-5 TB per node. Yet, horizontal scaling, using
peer-to-peer is one of the main point of Cassandra. You might be carefull
and scale when needed to never reach that much data per node.

As always, please experts/commiters, correct me if I am wrong.


2013/7/4 Michał Michalski <>

> I don't think you need to run repair if you decrease RF. At least I
> wouldn't do it.
> In case of *decreasing* RF have 3 nodes containing some data, but only 2
> of them should store them from now on, so you should rather run cleanup,
> instead of repair, toget rid of the data on 3rd replica. And I guess it
> should work (in terms of disk space and memory), if you've been able to
> perform compaction.
> Repair makes sense if you *increase* RF, so the data are streamed to the
> new replicas.
> M.
> W dniu 04.07.2013 12:20, Evan Dandrea pisze:
>  Hi,
>> We've made the mistake of letting our nodes get too large, now holding
>> about 3TB each. We ran out of enough free space to have a successful
>> compaction, and because we're on 1.0.7, enabling compression to get
>> out of the mess wasn't feasible. We tried adding another node, but we
>> think this may have put too much pressure on the existing ones it was
>> replicating from, so we backed out.
>> So we decided to drop RF down to 2 from 3 to relieve the disk pressure
>> and started building a secondary cluster with lots of 1 TB nodes. We
>> ran repair -pr on each node, but it’s failing with a JVM OOM on one
>> node while another node is streaming from it for the final repair.
>> Does anyone know what we can tune to get the cluster stable enough to
>> put it in a multi-dc setup with the secondary cluster? Do we actually
>> need to wait for these RF3->RF2 repairs to stabilize, or could we
>> point it at the secondary cluster without worry of data loss?
>> We’ve set the heap on these two problematic nodes to 20GB, up from the
>> equally too high 12GB, but we’re still hitting OOM. I had seen in
>> other threads that tuning down compaction might help, so we’re trying
>> the following:
>> in_memory_compaction_limit_in_**mb 32 (down from 64)
>> compaction_throughput_mb_per_**sec 8 (down from 16)
>> concurrent_compactors 2 (the nodes have 24 cores)
>> flush_largest_memtables_at 0.45 (down from 0.50)
>> stream_throughput_outbound_**megabits_per_sec 300 (down from 400)
>> reduce_cache_sizes_at 0.5 (down from 0.6)
>> reduce_cache_capacity_to 0.35 (down from 0.4)
>> -XX:**CMSInitiatingOccupancyFraction**=30
>> Here’s the log from the most recent repair failure:
>>**5843017/ <>
>> The OOM starts at line 13401.
>> Thanks for whatever insight you can provide.

View raw message