cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Morton (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-2191) Multithread across compaction buckets
Date Wed, 23 Mar 2011 11:52:06 GMT


Aaron Morton commented on CASSANDRA-2191:

I have a bunch of questions mostly because I'm trying to understand the reasons for doing
# If max is 0 SSTableTracker.markCompacting() will return an empty list rather than null.

# CompactionManager.submitMinorIfNeeded() sorts the SSTables in the bucket to compact the
older ones first. When the list is passed to SSTableTracker.markCompacting() the order is
# In CompactionManager.submitIndexBuild() and submmitSSTableBuild() should the calls to executor
be in an inner try block to ensure the lock is always released.
# If the size of the thread pool for CompactionManager.CompactionExecutor() is not configurable
is there a risk of using too many threads and saturating the IO with compaction? Could some
people want less than 1 thread per core?
# For my understanding: What about the CompactionExecutor using the JMXEnabledThreadPoolExecutor
so it's stats come back in TP Stats ? 
# There is a comment in CompactionManager.doCompaction() about relying on a single thread
in compaction to when determining if it's a major compaction. 
# The order in which the buckets are processed appears to be undefined. Would it make sense
to order them by number of files or avg size so there is a more predictable outcome with multiple
threads possibly working through a similar set of files? 
# For my understanding: Have you considered adding a flag to so that a minor compaction will
stop processing buckets if additional threads have started? I think this may make the compaction
less aggressive as it would more quickly fall back to a single thread until more were needed
# The order of the list returned from CompactionExecutor.getCompactions() is undefined. Could
they be returned in the order they were added to the executor to make to the data returned
from CompactionExecutor.getColumnFamilyInProgress() more reliable?

> Multithread across compaction buckets
> -------------------------------------
>                 Key: CASSANDRA-2191
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Priority: Critical
>              Labels: compaction
>             Fix For: 0.8
>         Attachments: 0001-Add-a-compacting-set-to-sstabletracker.txt, 0002-Use-the-compacting-set-of-sstables-to-schedule-multith.txt,
> This ticket overlaps with CASSANDRA-1876 to a degree, but the approaches and reasoning
are different enough to open a separate issue.
> The problem with compactions currently is that they compact the set of sstables that
existed the moment the compaction started. This means that for longer running compactions
(even when running as fast as possible on the hardware), a very large number of new sstables
might be created in the meantime. We have observed this proliferation of sstables killing
performance during major/high-bucketed compactions.
> One approach would be to pause compactions in upper buckets (containing larger files)
when compactions in lower buckets become possible. While this would likely solve the problem
with read performance, it does not actually help us perform compaction any faster, which is
a reasonable requirement for other situations.
> Instead, we need to be able to perform any compactions that are currently required in
parallel, independent of what bucket they might be in.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message