cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately
Date Sat, 29 Nov 2014 11:22:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228729#comment-14228729
] 

Benedict commented on CASSANDRA-7203:
-------------------------------------

[~jbellis]: Are we sure that's a good policy? It's generally accepted that a lot of work (esp.
that involving people, e.g. Netflix, Apple) follows a zipfian/extreme distribution. If we
can avoid the most voluminous customers from degrading performance for everybody, that's surely
a pretty big win? I'm not suggesting this be attacked immediately, but in the medium-to-long
term it seems like a pretty decent yield - and could be applied on both read and write. If
you have 1% of your data appearing in ~100% of sstables, but the other 99% appearing in only
~1% of your sstables, you're compacting an order of magnitude more often than you might otherwise
need to.

Perhaps [~jasobrown] and [~kohlisankalp] have an idea of how realistic this scenario is?

> Flush (and Compact) High Traffic Partitions Separately
> ------------------------------------------------------
>
>                 Key: CASSANDRA-7203
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>              Labels: compaction, performance
>
> An idea possibly worth exploring is the use of streaming count-min sketches to collect
data over the up-time of a server to estimating the velocity of different partitions, so that
high-volume partitions can be flushed separately on the assumption that they will be much
smaller in number, thus reducing write amplification by permitting compaction independently
of any low-velocity data.
> Whilst the idea is reasonably straight forward, it seems that the biggest problem here
will be defining any success metric. Obviously any workload following an exponential/zipf/extreme
distribution is likely to benefit from such an approach, but whether or not that would translate
in real terms is another matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message