cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominic Williams (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (CASSANDRA-3620) Proposal for distributed deletes - fully automatic "Reaper Model" rather than GCSeconds and manual repairs
Date Thu, 15 Dec 2011 14:20:31 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169761#comment-13169761
] 

Dominic Williams edited comment on CASSANDRA-3620 at 12/15/11 2:19 PM:
-----------------------------------------------------------------------

Ok I got it and +1 on that idea. I had actually assumed tombstones were compacted away after
repair anyway. So abandon GCSeconds and simply kill of tombstones created before repair when
it runs successfully (presumably on a range-by-range basis?)
* Improved performance through reduced tombstone load
* No risk of data corruption if repair not run

That would be a cool first step and improve the current situation. 

I think a reaper system is still needed though, although this feature would take some of the
existing pressure off. There would still be the issue of tombstone build up between repairs,
which means performance can vary (or actually, degrade) between invocations, the load spikes
from repair itself and the manual nature of the process.

I guess I'm on the sharp end of this - we have several column families where columns represent
game objects or messages owned by users where there is a high delete and insert load. Various
operations need to perform slices of user rows and these can get much slower as tombstones
build up, so GCSeconds has been brought right down, but this leads to the constant pain of
"omg how long left before need to run repair or increase GCSeconds" etc.. improving repair
as described would remove the Sword of Damocles threat of data corruption but we'd still need
to make sure it was run regularly, performance would degrade between invocations and repair
would create load spikes. The reaping model can take away those problems. 
                
      was (Author: dccwilliams):
    Ok I got it and +1 on that idea. Abandon GCSeconds and simply kill of tombstones created
before repair when it runs successfully (presumably on a range-by-range basis)
* Improved performance through reduced tombstone load
* No risk of data corruption if repair not run

That would be a very cool first step to optimize this

I think a reaper system would still be well worthwhile though, although this feature would
take some pressure off. There is still the issue of tombstone build up between repairs, which
means performance can vary (or actually, degrade) between invocations plus there are still
the load spikes from repair itself

I guess I'm on the sharp end of this - we have several column families where columns represent
game objects or messages owned by users where there is a high delete and insert load. Various
operations need to perform slices of user rows and these can get much slower as tombstones
build up, so GCSeconds has been brought right down, but this leads to the constant pain of
"omg how long left before need to run repair or increase GCSeconds" etc.. improving repair
would remove the Sword of Damocles thing but we'd still need to run it regularly and performance
wouldn't be as consistent it could be with constant background reaping
                  
> Proposal for distributed deletes - fully automatic "Reaper Model" rather than GCSeconds
and manual repairs
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, merkle_trees, repair,
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Proposal for an improved system for handling distributed deletes, which removes the requirement
to regularly run repair processes to maintain performance and data integrity. 
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive to run
> * Repair jobs are often made more expensive than they should be by other issues (nodes
dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud environments
where network issues make a node disappear from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either by error or because of issues
with Cassandra, data written to a node that did not see a later delete can reappear (and a
node might miss a delete for several reasons including being down or simply dropping requests
during load shedding)
> * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing,
in some cases the growing tombstone overhead can significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very difficult to
make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one
by one, making sure they succeed and keeping and eye on overall load to reduce system impact.
This isn't desirable, and problems are exacerbated when there are lots of column families
in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone
load (because there are many write/deletes to that column family). The database owner must
run repair within the GCSeconds window, or increase GCSeconds, to avoid potentially losing
delete operations. 
> It would be much better if there was no ongoing requirement to run repair to ensure deletes
aren't lost, and no GCSeconds window. Ideally repair would be an optional maintenance utility
used in special cases, or to ensure ONE reads get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have acknowledged
them
> # Tombstones are deleted (or marked for compaction) when they have been acknowledged
by all replicas
> # When a tombstone is deleted, it is added to a "relic" index. The relic index makes
it possible for a reaper to acknowledge a tombstone after it is deleted
> # The ACK lists and relic index are held in memory for speed
> # Background "reaper" threads constantly stream ACK requests to other nodes, and stream
back ACK responses back to requests they have received (throttling their usage of CPU and
bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it creates the
tombstone and adds an ACK for the requestor, and replies with an ACK. This is the worst that
can happen, and does not cause data corruption. 
> ADDENDUM
> The proposal to hold the ACK and relic lists in memory was added after the first posting.
Please see comments for full reasons. Furthermore, a proposal for enhancements to repair was
posted to comments, which would cause tombstones to be scavenged when repair completes (the
author had assumed this was the case anyway, but it seems at time of writing they are only
scavenged during compaction on GCSeconds timeout). The proposals are not exclusive and this
proposal is extended to include the possible enhancements to repair described.
> NOTES
> * If a node goes down for a prolonged period, the worst that can happen is that some
tombstones are recreated across the cluster when it restarts, which does not corrupt data
(and this will only occur with a very small number of tombstones)
> * The system is simple to implement and predictable 
> * With the reaper model, repair would become an optional process for optimizing the database
to increase the consistency seen by ConsistencyLevel.ONE reads, and for fixing up nodes, for
example after an sstable was lost
> h3. Planned Benefits
> * Reaper threads can utilize "spare" cycles to constantly scavenge tombstones in the
background thereby greatly reducing tombstone load, improving query performance, reducing
the system resources needed by processes such as compaction, and making performance generally
more predictable 
> * The reaper model means that GCSeconds is no longer necessary, which removes the threat
of data corruption if repair can't be run successfully within that period (for example if
repair can't be run because of a new adopter's lack of Cassandra expertise, a cron script
failing, or Cassandra bugs or other technical issues)
> * Reaper threads are fully automatic, work in the background and perform finely grained
operations where interruption has little effect. This is much better for database administrators
than having to manually run and manage repair, whether for the purposes of preventing data
corruption or for optimizing performance, which in addition to wasting operator time also
often creates load spikes and has to be restarted after failure.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message