cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pedro Gordo (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (CASSANDRA-12201) Burst Hour Compaction Strategy
Date Thu, 08 Jun 2017 08:03:21 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pedro Gordo updated CASSANDRA-12201:
------------------------------------
    Comment: was deleted

(was: Apologies for the lack of progress on this issue but to a bigger workload from my job
and other factors, I had to halt work on this ticket for six months. I'm now resuming this
piece of work.

I studied on the data structure for Cassandra 2.0 but from what I know, there were significant
changes to 3.0, so I'll need to get up to speed on the new data storage on 3.0. I was considering
to implement this just on 2.2, but from what I see here: http://cassandra.apache.org/doc/latest/development/patches.html
no contributions for previous versions are being accepted, so I'll need to go straight to
3.x.)

> Burst Hour Compaction Strategy
> ------------------------------
>
>                 Key: CASSANDRA-12201
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Pedro Gordo
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> This strategy motivation revolves around taking advantage of periods of the day where
there's less I/O on the cluster. This time of the day will be called “Burst Hour” (BH),
and hence the strategy will be named “Burst Hour Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more than the
compaction minimum threshold value.
> 2. Gather all the tables that have keys present in other tables, with a minimum of replicas
equal to the minimum compaction threshold. 
> 3. Repeat step 2 until the bucket for gathered SSTables reaches the maximum compaction
threshold (32 by default), or until we've searched all the keys.
> 4. The compaction per se will be done through by MaxSSTableSizeWriter. The compacted
tables will have a maximum size equal to the configurable value of max_sstable_size (100MB
by default). 
> The maximum compaction task (nodetool compact command), does exactly the same operation
as the background compaction task, but differing in that it can be triggered outside of the
Burst Hour.
> This strategy tries to address three issues of the existing compaction strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a huge compaction.
> - The number of SSTables that we need to read from to reply to a read query will be consistently
maintained at a low level and controllable through the referenced_sstable_limit property.
> - It removes the dependency of a continuous high I/O.
> Possible future improvements:
> - Continuously evaluate how many pending compactions we have and I/O status, and then
based on that, we start (or not) the compaction.
> - If during the day, the size for all the SSTables in a family set reaches a certain
maximum, then background compaction can occur anyway. This maximum should be elevated due
to the high CPU usage of BHCS.
> - Make it possible to set several compaction times intervals, instead of just one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message