cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements
Date Sat, 30 Sep 2017 09:17:04 GMT


DOAN DuyHai commented on CASSANDRA-13442:

So by flagging 2 replicas as fully repaired and used for consistent reads and one as transient,
we're introducing indeed *asymmetry* into the replicas themselves.  

For the following reasoning suppose RF=3,  2 _consistent_ replicas and 1 _transient_ replica

*I  Consistency*

Now reasoning with CL from the client POV is no longer straightforward, the client should
know which replica(s) are fully repaired and which are not.

Furthermore, now we need to keep meta data about repaired and unrepaired range of data.

- suppose a {{SELECT * FROM table WHERE partition = xxx AND clustering >= yyyy AND clustering
<= zzz}}. What if some ranges of clustering have been repaired and some other aren't ?
Do we query QUORUM on 3 replicas or does the coordinator issue 2 queries, one for the repaired
range on 1 consistent replica and 2 on a QUORUM of 3 replicas for the unrepaired range ?

- how do we reason with CL like global multi-DC QUORUM and EACH_QUORUM for reads ?

*II Failure handling*

Failure of replica is now *asymmetric* too for reads. A failed _transient_ replica is a no
brainer. A failed _consistent_ replica(s) is a bigger issue. With classic replication CL=ONE
guarantee you to be highly available in the face of 2 failed replicas. Now having 2 failed
_consistent_ replicas, since the repaired data has been deleted on the _transient_ replica,
the CL=ONE contract is broken ...

*III Operations*

- how to manage bootstrap of new node(s) ? Is the new node(s) considered transient for all
ranges until at least one round of repair has finished ? Obviously bootstrapping new node(s)
by streaming data from _consistent_ replicas will make more sense for repaired data but in
this case the bootstrap process itself need to be repair-aware

- how to manage decommission ? How do we dispatch repaired and unrepaired range of data so
that the remaining nodes in the cluster are balanced (with regard to the _consistent_ and
_transient_ replica state) ?

- what's about MV and secondary index ? I know that MV is likely to be flagged as experimental
but unless we remove it completely from the base code we should also think about those concepts
of _transient_ replicas for MV at least.

- what's about backup & restore ? Now any backup process should also be repair-aware and
also know the location of the _consistent_ replicas

- for *existing* clusters, how to transition from normal replication to this new replication
scheme ? And vice versa 

> Support a means of strongly consistent highly available replication with tunable storage
> -----------------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-13442
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction, Coordination, Distributed Metadata, Local Write-Read
>            Reporter: Ariel Weisberg
> Replication factors like RF=2 can't provide strong consistency and availability because
if a single node is lost it's impossible to reach a quorum of replicas. Stepping up to RF=3
will allow you to lose a node and still achieve quorum for reads and writes, but requires
committing additional storage.
> The requirement of a quorum for writes/reads doesn't seem to be something that can be
relaxed without additional constraints on queries, but it seems like it should be possible
to relax the requirement that 3 full copies of the entire data set are kept. What is actually
required is a covering data set for the range and we should be able to achieve a covering
data set and high availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. At that
point we don't have to read from a quorum of nodes for the repaired data. It is sufficient
to read from a single node for the repaired data and a quorum of nodes for the unrepaired
> One way to exploit this would be to have N replicas, say the last N replicas (where N
varies with RF) in the preference list, delete all repaired data after a repair completes.
Subsequent quorum reads will be able to retrieve the repaired data from any of the two full
replicas and the unrepaired data from a quorum read of any replica including the "transient"
> Configuration for something like this in NTS might be something similar to { DC1="3-1",
DC2="3-2" } where the first value is the replication factor used for consistency and the second
values is the number of transient replicas. If you specify { DC1=3, DC2=3 } then the number
of transient replicas defaults to 0 and you get the same behavior you have today.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message