cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominic Williams (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3620) Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds and scheduled repairs
Date Tue, 13 Dec 2011 23:29:30 GMT


Dominic Williams commented on CASSANDRA-3620:

Make it optional per column family? Repair would still need to exist anyway so could fall
back to that for cases like this.
> Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds and scheduled
> -------------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-3620
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, merkle_trees, repair,
>   Original Estimate: 504h
>  Remaining Estimate: 504h
> Here is a proposal for an improved system for handling distributed deletes.
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive anyway
> * Repair jobs are often made more expensive than they should be by other issues (nodes
dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud environments
where network issues make a node disappear 
> from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either by error or because of issues
with Cassandra, data written to a node that did not see a later delete can reappear (and a
node might miss a delete for several reasons including being down or simply dropping requests
during load shedding)
> * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing,
in some cases the growing tombstone overhead can significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very difficult to
make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one
by one, making sure they succeed and keeping and eye on overall load to reduce system impact.
This isn't desirable, and problems are exacerbated when there are lots of column families
in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone
load (because there are many write/deletes to that column family). The database owner must
run repair within the GCSeconds window, or increase GCSeconds, to avoid potentially losing
delete operations. 
> It would be much better if there was no ongoing requirement to run repair to ensure deletes
aren't lost, and no GCSeconds window. Ideally repair would be an optional maintenance utility
used in special cases, or to ensure ONE reads get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have acknowledged
> # Tombstones are only deleted (or marked for compaction) when they have been acknowledged
by all replicas
> # When a tombstone is deleted, it is added to a fast "relic" index of MD5 hashes of cf-key-name[-subName]-ackList.
The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted
> # Background "reaper" threads constantly stream ACK requests to other nodes, and stream
back ACK responses back to requests they have received (throttling their usage of CPU and
bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it creates the
tombstone and adds an ACK for the requestor, and replies with an ACK 
> * The existence of entries in the relic index do not affect normal query performance
> * If a node goes down, and comes up after a configurable relic entry timeout, the worst
that can happen is that a tombstone that hasn't received all its acknowledgements is re-created
across the replicas when the reaper requests their acknowledgements (which is no big deal
since this does not corrupt data)
> * Since early removal of entries in the relic index does not cause corruption, it can
be kept small, or even kept in memory
> * Simple to implement and predictable 
> h3. Planned Benefits
> * Operations are finely grained (reaper interruption is not an issue)
> * The labour & administration overhead associated with running repair can be removed
> * Reapers can utilize "spare" cycles and run constantly in background to prevent the
load spikes and performance issues associated with repair
> * There will no longer be the threat of corruption if repair can't be run for some reason
(for example because of a new adopter's lack of Cassandra expertise, a cron script failing,
or Cassandra bugs preventing repair being run etc)
> * Deleting tombstones earlier, thereby reducing the number involved in query processing,
will often dramatically improve performance

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message