incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henrik Schröder <>
Subject Re: Question regarding major compaction.
Date Tue, 01 May 2012 11:31:14 GMT
But what's the difference between doing an extra read from that One Big
File, than doing an extra read from whatever SSTable happen to be largest
in the course of automatic minor compaction?

We have a pretty update-heavy application, and doing a major compaction can
remove up to 30% of the used diskspace. That directly translates into less
reads and less SSTables that rows appear in. Everything that's unchanged
since the last major compaction is obviously faster to access, and
everything that's changed since the last major compaction is about the same
as if we hadn't done it?

So I'm still confused. I don't see a significant difference between doing
the occasional major compaction or leaving it to do automatic minor
compactions. What am I missing? Reads will "continually degrade" with
automatic minor compactions as well, won't they?

I can sort of see that if you have a moving active data set, then that will
most probably only exist in the smallest SSTables and frequently be the
object of minor compactions, and doing a major compaction will move all of
it into the biggest SSTables?


On Mon, Apr 30, 2012 at 05:35, aaron morton <> wrote:

> Depends on your definition of significantly, there are a few things to
> consider.
> * Reading from SSTables for a request is a serial operation. Reading from
> 2 SSTables will take twice as long as 1.
> * If the data in the One Big File™ has been overwritten, reading it is a
> waste of time. And it will continue to be read until it the row is
> compacted away.
> * You will need to get min_compaction_threshold (CF setting) SSTables that
> big before automatic compaction will pickup the big file.
> On the other side: Some people do report getting value from nightly major
> compactions. They also manage their cluster to reduce the impact of
> performing the compactions.
> Hope that helps.
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> On 26/04/2012, at 9:37 PM, Fredrik wrote:
>  Exactly, but why would reads be significantly slower over time when
> including just one more, although sometimes large, SSTable in the read?
> Ji Cheng skrev 2012-04-26 11:11:
> I'm also quite interested in this question. Here's my understanding on
> this problem.
>  1. If your workload is append-only, doing a major compaction shouldn't
> affect the read performance too much, because each row appears in one
> sstable anyway.
>  2. If your workload is mostly updating existing rows, then more and more
> columns will be obsoleted in that big sstable created by major compaction.
> And that super big sstable won't be compacted until you either have another
> 3 similar-sized sstables or start another major compaction. But I am not
> very sure whether this will be a major problem, because you only end up
> with reading one more sstable. Using size-tiered compaction against
> mostly-update workload itself may result in reading multiple sstables for a
> single row key.
>  Please correct me if I am wrong.
>  Cheng
>  On Thu, Apr 26, 2012 at 3:50 PM, Fredrik <
>> wrote:
>> In the tuning documentation regarding Cassandra, it's recomended not to
>> run major compactions.
>> I understand what a major compaction is all about but I'd like an in
>> depth explanation as to why reads "will continually degrade until the next
>> major compaction is manually invoked".
>> From the doc:
>> "So while read performance will be good immediately following a major
>> compaction, it will continually degrade until the next major compaction is
>> manually invoked. For this reason, major compaction is NOT recommended by
>> DataStax."
>> Regards
>> /Fredrik

View raw message