cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thakrar, Jayesh" <>
Subject Re: repair performance
Date Mon, 20 Mar 2017 13:38:58 GMT
Another thing.
Based on what I see in our system especially when I was changing from STCS to LCS compaction
compaction does cause quite a bit of memory churn and it helps to increase heap memory to
certain extent.
You can see heap sizes using nodetool info to gauge your usage and hwm.
Enabling gc logging helps as well to see the impact.

From: Roland Otta <>
Date: Monday, March 20, 2017 at 1:53 AM
To: Conversant <>, "" <>
Subject: Re: repair performance

good point! i did not (so far) i will do that - especially because i often see all compaction
threads being used during repair (according to compactionstats).

thank you also for your link recommendations. i will go through them.

On Sat, 2017-03-18 at 16:54 +0000, Thakrar, Jayesh wrote:
You changed compaction_throughput_mb_per_sec, but did you also increase concurrent_compactors?

In reference to the reaper and some other info I received on the user forum to my question
on "nodetool repair", here are some useful links/slides -

From: Roland Otta <>
Date: Friday, March 17, 2017 at 5:47 PM
To: "" <>
Subject: Re: repair performance

did not recognize that so far.

thank you for the hint. i will definitely give it a try

On Fri, 2017-03-17 at 22:32 +0100, benjamin roth wrote:
The fork from thelastpickle is. I'd recommend to give it a try over pure nodetool.

2017-03-17 22:30 GMT+01:00 Roland Otta <<>>:

forgot to mention the version we are using:

we are using 3.0.7 - so i guess we should have incremental repairs by default.
it also prints out incremental:true when starting a repair
INFO  [Thread-7281] 2017-03-17 09:40:32,059 - Starting repair command
#7, repairing keyspace xxx with repair options (parallelism: parallel, primary range: false,
incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [],
# of ranges: 1758)

3.0.7 is also the reason why we are not using reaper ... as far as i could figure out it's
not compatible with 3.0+

On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote:
It depends a lot ...

- Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, whatever)
- You can use incremental repairs to speed things up for regular repairs
- You can use "reaper" to schedule repairs and run them sliced, automated, failsafe

The time repairs actually may vary a lot depending on how much data has to be streamed or
how inconsistent your cluster is.

50mbit/s is really a bit low! The actual performance depends on so many factors like your
CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of the cluster.
This is a quite individual problem you have to track down individually.

2017-03-17 22:07 GMT+01:00 Roland Otta <<>>:


we are quite inexperienced with cassandra at the moment and are playing
around with a new cluster we built up for getting familiar with
cassandra and its possibilites.

while getting familiar with that topic we recognized that repairs in
our cluster take a long time. To get an idea of our current setup here
are some numbers:

our cluster currently consists of 4 nodes (replication factor 3).
these nodes are all on dedicated physical hardware in our own
datacenter. all of the nodes have

32 cores @2,9Ghz
64 GB ram
2 ssds (raid0) 900 GB each for data
1 seperate hdd for OS + commitlogs

current dataset:
approx 530 GB per node
21 tables (biggest one has more than 200 GB / node)

i already tried setting compactionthroughput + streamingthroughput to
unlimited for testing purposes ... but that did not change anything.

when checking system resources i cannot see any bottleneck (cpus are
pretty idle and we have no iowaits).

when issuing a repair via

nodetool repair -local on a node the repair takes longer than a day.
is this normal or could we normally expect a faster repair?

i also recognized that initalizing of new nodes in the datacenter was
really slow (approx 50 mbit/s). also here i expected a much better
performance - could those 2 problems be somehow related?


View raw message