incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <>
Subject Re: handling deletes
Date Wed, 01 Apr 2009 17:41:08 GMT

I am wondering if this is part of the bigger issue on data consistency.

Following your example: a row x is replicated to node A, B, and C. C goes
down. A and B delete x. When C comes back, C should contact other nodes
that hold hinted handoff data intended for C. So, in theory, the missing
deletion of x will be propagated to C at some point and not lost. However,
the problem is that those hinted handoff nodes can die before the handoff
completes. Then C need some other way to sync itself up. Node A and B are
the only possible sources. Unfortunately, data in A and B are accumulated
independently from C, and therefore syncing them up is a bit challenging.

In the short run, I am not sure if I really like the solution you suggested
here. However, I don't have a better solution either.

In the long run, maybe we should look into peer-to-peer replication
techniques, instead of relying on hinted handoff. In P2P replication, an
update can be directed to any replica, which will try to push it to its
peers. The push will be almost real time if the peers are up. If a peer is
down, changes for it will be accumulated and re-pushed when it's up again.
Because an update is always initiated from one replica, it's easier to sync
up the replicas through log shipping. The benefit of this approach is that
(1) easier reasoning about data synchronization/consistency (still have to
be careful about deletes though); (2) potentially less overhead since we
don't have to do read repairs all the time (anybody know how much overhead
it introduces?); (3) almost the same availability on writes as Cassandra

IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

             Jonathan Ellis                                                
             m>                                                         To 
             03/30/2009 03:19                                           cc 
                                       handling deletes                    
             Please respond to                                             

Avinash pointed out two bugs in my remove code.  One is easy to fix,
the other is tougher.

The easy one is that my code removes tombstones (deletion markers) at
the ColumnFamilyStore level, so when CassandraServer does read repair
it will not know about the tombstones and they will not be replicated
correctly.  This can be fixed by simply moving the removeDeleted call
up to just before CassandraServer's final return-to-client.

The hard one is that tombstones are problematic on GC (that is, major
compaction of SSTables, to use the Bigtable paper terminology).

One failure scenario: Node A, B, and C replicate some data.  C goes
down.  The data is deleted.  A and B delete it and later GC it.  C
comes back up.  C now has the only copy of the data so on read repair
the stale data will be sent to A and B.

A solution: pick a number N such that we are confident that no node
will be down (and catch up on hinted handoffs) for longer than N days.
 (Default value: 10?)  Then, no node may GC tombstones before N days
have elapsed.  Also, after N days, tombstones will no longer be read
repaired.  (This prevents a node which has not yet GC'd from sending a
new tombstone copy to a node that has already GC'd.)

Implementation detail: we'll need to add a 32-bit "time of tombstone"
to ColumnFamily and SuperColumn.  (For Column we can stick it in the
byte[] value, since we already have an unambiguous way to know if the
Column is in a deleted state.)  We only need 32 bits since the time
frame here is sufficiently granular that we don't need ms.  Also, we
will use the system clock for these values, not the client timestamp,
since we don't know what the source of the client timestamps is.

Admittedly this is suboptimal compared to being able to GC immediately
but it has the virtue of being (a) easily implemented, (b) with no
extra components such as a coordination protocol, and (c) better than
not GCing tombstones at all (the other easy way to ensure



  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message