cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pedro Gordo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12201) Burst Hour Compaction Strategy
Date Wed, 08 Feb 2017 10:30:42 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854024#comment-15854024
] 

Pedro Gordo edited comment on CASSANDRA-12201 at 2/8/17 10:30 AM:
------------------------------------------------------------------

Apologies for the lack of progress on this issue but to a bigger workload from my job and
other factors, I had to halt work on this ticket for six months. I'm now resuming this piece
of work.

I studied on the data structure for Cassandra 2.0 but from what I know, there were significant
changes to 3.0, so I'll need to get up to speed on the new data storage on 3.0. I was considering
to implement this just on 2.2, but from what I see here: http://cassandra.apache.org/doc/latest/development/patches.html
no contributions for previous versions are being accepted, so I'll need to go straight to
3.x.


was (Author: pedro_gordo):
Apologies for the lack of progress on this issue but due factors I had to halt work on this
for six months. I'm now resuming this piece of work.

I studied on the data structure for Cassandra 2.0 but from what I know, there were significant
changes to 3.0, so I'll need to get up to speed on the new data storage on 3.0. I was considering
to implement this just on 2.2, but from what I see here: http://cassandra.apache.org/doc/latest/development/patches.html
no contributions for previous versions are being accepted, so I'll need to go straight to
3.x.

> Burst Hour Compaction Strategy
> ------------------------------
>
>                 Key: CASSANDRA-12201
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Pedro Gordo
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> Although it may be subject to changes, for the moment I plan to create a strategy that
will revolve around taking advantage of periods of the day where there's less I/O on the cluster.
This time of the day will be called “Burst Hour” (BH), and hence the strategy will be
named “Burst Hour Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more than a configurable
value which I'll call referenced_sstable_limit. This value will be three by default.
> 2. Group all the repeated keys with a reference to the SSTables containing them.
> 3. Calculate the total size of the SSTables which will be merged for the first partition
key on the list created in step 2. If the size calculated is bigger than property which I'll
call max_sstable_size (also configurable), more than one table will be created in step 4.
> 4. During the merge, the data will be streamed from SSTables up to a point when we have
a size close to max_sstable_size. After we reach this point, the stream is paused, and the
new SSTable will be closed, becoming immutable. Repeat the streaming process until we've merged
all tables for the partition key that we're iterating.
> 5. Cycle through the rest of the collection created in step 2 and remove any SSTables
which don't exist anymore because they were merged in step 5. An alternative course of action
here would be to, instead of removing the SSTable from the collection, to change its reference
to the SSTable(s) which was created in step 5. 
> 6. Repeat from step 3 to step 6 until we traversed the entirety of the collection created
in step 2.
> This strategy addresses three issues of the existing compaction strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a huge compaction,
as it can happen on STCS.
> - The number of SSTables that we need to read from to reply to a read query will be consistently
maintained at a low level and controllable through the referenced_sstable_limit property.
This addresses the scenario of STCS when we might have to read from a lot of SSTables.
> - It removes the dependency of a continuous high I/O of LCS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message