cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Hobbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6109) Consider coldness in STCS compaction
Date Fri, 11 Oct 2013 23:16:42 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793127#comment-13793127
] 

Tyler Hobbs commented on CASSANDRA-6109:
----------------------------------------

I've spent some more time thinking about this and it seems like we either need a more sophisticated
approach in order to handle the various corner cases or we need to disable this feature by
default.

If we disable the feature by default, then using a hotness percentile or something similar
might be okay.

If we want to enable the feature by default, I've got a couple of more sophisticated approaches:

The first approach is fairly simple and uses two parameters:
* SSTables which receive less than X% of the reads/sec per key of the hottest sstable (for
the whole CF) will be considered cold.
* If the cold sstables make up more than Y% of the total reads/sec, don't consider the warmest
of the cold sstables cold. (In other words, go through the "cold" bucket and remove the warmest
sstables until the cold bucket makes up less than %Y of the total reads/sec.)

This solves one problem of basing coldness on the mean rate, which is that if you have almost
all cold sstables, the mean will be very low.  Comparing against the max deals well with this.
 The second parameter acts as a hedge for the case you brought up where a large number of
cold sstables can collectively account for a high percentage of the total reads.

The second approach is less hacky but more difficult to explain or tune; it's an bucket optimization
measure that covers these concerns.  Ideally, we would optimize two things:
* Average sstable hotness of the bucket
* The percentage of the total CF reads that are included in the bucket

These two items are somewhat in opposition.  Optimizing only for the first measure would mean
just compacting the two hottest sstables.  Optimizing only for the second would mean compacting
all sstables.  We can combine the two measures with different weightings to get a pretty good
bucket optimization measure.  I've played around with some different measures in python and
have a script that makes approximately the same bucket choices I would.  However, as I mentioned,
this would be pretty hard for operators to understand and tune intelligently, somewhat like
phi_convict_threshold.  If you're still open to that, I can attach my script with some example
runs.

> Consider coldness in STCS compaction
> ------------------------------------
>
>                 Key: CASSANDRA-6109
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6109
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 2.0.2
>
>         Attachments: 6109-v1.patch, 6109-v2.patch
>
>
> I see two options:
> # Don't compact cold sstables at all
> # Compact cold sstables only if there is nothing more important to compact
> The latter is better if you have cold data that may become hot again...  but it's confusing
if you have a workload such that you can't keep up with *all* compaction, but you can keep
up with hot sstable.  (Compaction backlog stat becomes useless since we fall increasingly
behind.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message