cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongfeng Lu <>
Subject How to remove huge files with all expired data sooner?
Date Fri, 25 Sep 2015 18:40:42 GMT
Hi I have a table where I set TTL to only 7 days for all records and we keep pumping records
in every day. In general, I would expect all data files for that table to have timestamps
less than, say 8 or 9 days old, giving the system some time to work its magic. However, I
see some files more than 9 days old occationally. Last Friday, I saw 4 large files, each about
10G in size, with timestamps about 5, 4, 3, 2 weeks old. Interestingly they are all gone this
Monday, leaving 1 new file 9 GB in size.

The compaction strategy is SizeTieredCompactionStrategy, and I can understand why the above
happened. It seems we have 10G of data every week and when SizeTieredCompactionStrategy works
to create various tiers, it just happened the file size for the next tier is 10G, and all
the data is packed into this huge file. Then it starts the next cycle. Another week goes by,
and another 10G file is created. This process continues until the minimum number of files
of the same size is reached, which I think is 4 by default. Then it started to compact this
set of 4 10G files. At this time, all data in these 4 files have expired so we end up with
nothing or much smaller file if there is still some records with TTL left.

I have many tables like this, and I'd like to reclaim those spaces sooner. What would be the
best way to do it? Should I run "nodetool compact" when I see two large files that are 2 weeks
old? Is there configuration parameters I can tune to achieve the same effect? I looked through
all the CQL Compaction Subproperties for STCS, but I am not sure how they can help here. Any
suggestion is welcome.

BTW, I am using Cassandra 2.0.6.

View raw message