cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominic Williams (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds and scheduled repairs
Date Tue, 13 Dec 2011 11:59:31 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dominic Williams updated CASSANDRA-3620:
----------------------------------------

    Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues (nodes dropping
requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud environments where
a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or because of
issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start overloading
your system

Because of the foregoing, in high throughput environments you often cannot make repair a cron
job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed
and keeping and eye on overall load so you don't impact your system. This isn't great, and
it is made worse if you have lots of column families or have to run a low GCSeconds on a column
family to reduce tombstone load. You know that if you don't manage to run repair with the
GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your
head.

Running repair to deal with missing writes isn't so important, since QUORUM reads will always
receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds.
Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads
get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been acknowledged
by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply
reset
# Background "reaper" threads constantly stream ACK requests and ACKs from other replicas
and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are added/removed
that can be discussed in separate Jira

h3. Advantages

* The labour/administration overhead associated with running repair will be removed
* The reapers can utilize "spare" cycles and run constantly to prevent the load spikes and
performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some reason (for
example because of a new adopter's lack of Cassandra expertise, a cron script failing, or
Cassandra bugs preventing repair being run etc)
* Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically*



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues (nodes dropping
requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud environments where
a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or because of
issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start overloading
your system

Because of the foregoing, in high throughput environments you often cannot make repair a cron
job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed
and keeping and eye on overall load so you don't impact your system. This isn't great, and
it is made worse where you have lots of column families or where you have to run a low GCSeconds
on a column family to reduce tombstone load. You know that if you don't manage to run repair
with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles
over your head.

Running repair to deal with missing writes isn't so important, since QUORUM reads will always
receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds.
Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads
get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been acknowledged
by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply
reset
# Background "reaper" threads constantly stream ACK requests and ACKs from other replicas
and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are added/removed
that can be discussed in separate Jira

h3. Advantages

* The labour/administration overhead associated with running repair will be removed
* The reapers can utilize "spare" cycles and run constantly to prevent the load spikes and
performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some reason (for
example because of a new adopter's lack of Cassandra expertise, a cron script failing, or
Cassandra bugs preventing repair being run etc)
* Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically*



    
> Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds and scheduled
repairs
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.5
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, merkle_trees, repair,
>             Fix For: 1.1
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Here is a proposal for an improved system for handling distributed deletes.
> h2. The Problem
> Repair has issues:
> * Repair is expensive anyway
> * Repair jobs are often made more expensive than they should be by other issues (nodes
dropping requests, hinted handoff not working, downtime etc)
> * Repair can often itself fail and need restarting, especially in cloud environments
where a network issue might make a node disappear 
> from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either because you are dumb or because
of issues with Cassandra, deleted data can reappear 
> * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading
your system
> Because of the foregoing, in high throughput environments you often cannot make repair
a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure
they succeed and keeping and eye on overall load so you don't impact your system. This isn't
great, and it is made worse if you have lots of column families or have to run a low GCSeconds
on a column family to reduce tombstone load. You know that if you don't manage to run repair
with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles
over your head.
> Running repair to deal with missing writes isn't so important, since QUORUM reads will
always receive data successfully written with QUORUM.
> Ideally there should be no ongoing requirement to run repair to avoid data loss, and
no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to
ensure ONE reads get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds. 
> # Tombstones have associated ACK lists, which record the replicas that have acknowledged
them
> # Tombstones are only deleted (or marked for compaction) when they have been acknowledged
by all replicas.
> # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is
simply reset
> # Background "reaper" threads constantly stream ACK requests and ACKs from other replicas
and deletes tombstones that have received all their ACKs
> A number of systems could be used to maintain synchronization while nodes are added/removed
that can be discussed in separate Jira
> h3. Advantages
> * The labour/administration overhead associated with running repair will be removed
> * The reapers can utilize "spare" cycles and run constantly to prevent the load spikes
and performance issues associated with repair
> * There will no longer be the risk of data loss if repair can't be run for some reason
(for example because of a new adopter's lack of Cassandra expertise, a cron script failing,
or Cassandra bugs preventing repair being run etc)
> * Reducing the number of tombstones databases carry will improve performance, sometimes
*dramatically*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message