Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking.
But on the 3rd node, we suspect major compaction didn't actually finish it's job…
The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started.
8GB heap
The default is 4GB max now days.
1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
I cannot answer that.
2) Should we restart with leveled compaction next year?
I would run some tests to see how it works for you workload.
4) Should we consider increasing the cluster capacity?
IMHO yes.
You may also want to do some experiments with turing compression on if it not already enabled.
Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?
With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take.
Hope that helps.
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand
@aaronmorton
Hi guys,
Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings.
Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy.
The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).
Questions:
1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
[Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space].
2) Should we restart with leveled compaction next year?
[Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files => to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]
3) In case we keep SizeTiered:
- How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions?
- Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations?
[Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and reading at consistency level ALL. We read primarily for exporting reasons - we export 1 week worth of data at a time].
4) Should we consider increasing the cluster capacity?
[We generate ~5million new rows every week which shouldn't come close to the hundreds of millions of rows on a node mentioned by Aaron which are the volumes that would create problems with bloom filters and indexes].
Cheers,
Alex
------------------
The situation in the data folder
before calling nodetool comapact:
du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
376G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
305G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
39G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
78G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
81G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
205M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
333M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
92M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
92M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
99M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
2.5G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
1.4T total
after nodetool comapact returned:
du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
910G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
19G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
19G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
5.0G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
4.8G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
338M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
339M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
339M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
98M
Looking at the disk occupancy for the logical partition where the data folder is in:
df /data_bst
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 2927242720 1482502260 1444740460 51% /data_bst
and the situation in the cluster
nodetool -h $HOSTNAME ring (before major compaction)
Address DC Rack Status State Load Effective-Ownership Token
113427455640312821154458202477256070484
10.146.44.17 datacenter1 rack1 Up Normal 1.37 TB 66.67% 0
10.146.44.18 datacenter1 rack1 Up Normal 1.04 TB 66.67% 56713727820156410577229101238628035242
10.146.44.32 datacenter1 rack1 Up Normal 1.14 TB 66.67% 113427455640312821154458202477256070484
nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting data in the meantime)
Address DC Rack Status State Load Effective-Ownership Token
113427455640312821154458202477256070484
10.146.44.17 datacenter1 rack1 Up Normal 1.38 TB 66.67% 0
10.146.44.18 datacenter1 rack1 Up Normal 1.08 TB 66.67% 56713727820156410577229101238628035242
10.146.44.32 datacenter1 rack1 Up Normal 1.19 TB 66.67% 113427455640312821154458202477256070484