incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: nodetool cleanup - results in more disk use?
Date Mon, 04 Apr 2011 12:58:56 GMT
mmm, interesting. My theory was....

t0 - major compaction runs, there is now one sstable 
t1 - x new sstables have been created
t2 - minor compaction runs and determines there are two buckets, one with the x new sstables
and one with the single big file. The bucket of many files is compacted into one, the bucket
of one file is ignored. 

I can see that it takes longer for the big file to be involved in compaction again, and when
it finally was it would take more time. But that minor compactions of new SSTables would still
happen at the same rate, especially if they are created at the same rate as previously. 

Am I missing something or am I just reading the docs wrong ?

Cheers
Aaron


On 4 Apr 2011, at 22:20, Jonathan Colby wrote:

> hi Aaron -
> 
> The Datastax documentation brought to light the fact that over time, major compactions
 will be performed on bigger and bigger SSTables.   They actually recommend against performing
too many major compactions.  Which is why I am wary to trigger too many major compactions
...
> 
> http://www.datastax.com/docs/0.7/operations/scheduled_tasks
> Performing Major Compaction¶
> 
> A major compaction process merges all SSTables for all column families in a keyspace
– not just similar sized ones, as in minor compaction. Note that this may create extremely
large SStables that result in long intervals before the next minor compaction (and a resulting
increase in CPU usage for each minor compaction).
> 
> Though a major compaction ultimately frees disk space used by accumulated SSTables, during
runtime it can temporarily double disk space usage. It is best to run major compactions, if
at all, at times of low demand on the cluster.
> 
> 
> 
> 
> 
> 
> 
> On Apr 4, 2011, at 1:57 PM, aaron morton wrote:
> 
>> cleanup reads each SSTable on disk and writes a new file that contains the same data
with the exception of rows that are no longer in a token range the node is a replica for.
It's not compacting the files into fewer files or purging tombstones. But it is re-writing
all the data for the CF. 
>> 
>> Part of the process will trigger GC if needed to free up disk space from SSTables
no longer needed.
>> 
>> AFAIK having fewer bigger files will not cause longer minor compactions. Compaction
thresholds are applied per bucket of files that share a similar size, there is normally more
smaller files and fewer larger files. 
>> 
>> Aaron
>> 
>> On 2 Apr 2011, at 01:45, Jonathan Colby wrote:
>> 
>>> I discovered that a Garbage collection cleans up the unused old SSTables.   But
I still wonder whether cleanup really does a full compaction.  This would be undesirable if
so.
>>> 
>>> 
>>> On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote:
>>> 
>>>> I ran node cleanup on a node in my cluster and discovered the disk usage
went from 3.3 GB to 5.4 GB.  Why is this?
>>>> 
>>>> I thought cleanup just removed hinted handoff information.   I read that
*during* cleanup extra disk space will be used similar to a compaction.  But I was expecting
the disk usage to go back down when it finished.
>>>> 
>>>> I hope cleanup doesn't trigger a major compaction.  I'd rather not run major
compactions because it means future minor compactions will take longer and use more CPU and
disk.
>>>> 
>>>> 
>>> 
>> 
> 


Mime
View raw message