incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terje Marthinussen <tmarthinus...@gmail.com>
Subject Re: compaction strategy
Date Tue, 10 May 2011 16:20:40 GMT
> Everyone may be well aware of that, but I'll still remark that a minor
> compaction
> will try to merge "as many 20MB sstables as it can" up to the max
> compaction
> threshold (which is configurable). So if you do accumulate some newly
> created
> sstable at some point in time, the next minor compaction will take all of
> them
> and thus not create a 40 MB sstable, then 80MB etc... Sure there will be
> more
> step than with a major compaction, but let's keep in mind we don't
> merge sstables
> 2 by 2.
>

Well, you do kind of merge them 2 by 2 as you look for at least 4 at a time
;)
But yes, 20MB should become at least 80MB. Still quite a few hops to reach
100GB.

I'm also not too much in favor of triggering major compactions,
> because it mostly
> have a nasty effect (create one huge sstable). Now maybe we could expose
> the
> difference factor for which we'll consider sstables in the same bucket
>

The nasty side effect I am scared of is disk space and to keep the disk
space under control, I need to get down to 1 file.

As an example:
2 days ago, I looked at a system that had gone idle from compaction with
something like 24 sstables.
Disk use was 370GB.

After manually triggering full compaction,  I was left with a single sstable
which is 164 GB large.

This means I may need more than 3x the full dataset to survive if certain
nasty events such as repairs or anti compactions should occur.
Way more than the recommended 2x.

In the same system, I see nodes reaching up towards 900GB during compaction
and 5-600GB otherwise.
This is with OPP, so distribution is not 100% perfect, but I expect these
5-600GB nodes to compact down to the <200GB area if a full compaction is
triggered.

That is way way beyond the recommendation to have 2x the disk space.

You may disagree, but I think this is a problem.
Either we need to recommend 3-5x the best case disk usage or we need to fix
cassandra.

A simple improvement initially may be to change the bucketing strategy if
you cannot find suitable candidates.
I believe lucene for instance has a strategy where it can mix a set of small
index fragments with one large.
This may be possible to consider as a fallback strategy and just let
cassandra compact down to 1 file whenever it can.

Ultimately, I think segmenting on token space is the only way to fix this.
That segmentation could be done by building histograms of your token
distribution as you compact and the compaction can further adjust the
segments accordingly as full compactions take place.

This would seem simpler to do than a full vnode based infrastructure.

Terje

Mime
View raw message