cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements
Date Mon, 16 Oct 2017 17:55:00 GMT


Ariel Weisberg commented on CASSANDRA-13442:

bq. It's still not clear to me what you mean by 10-20x, are you saying you could need 1-100x
more hardware if you don't use a transitive replica?
I mean that if you are running RF=3 in a data center you could switch to RF=5 and the increase
in required capacity would be as if you were running at RF=3.1 or 3.2. The ranges each node
transiently replicates would contain 10x-20x less data than a fully replicated range (even
during node failures) and would require 10x-20x less CPU, network IO, disk IO, and CPU.

Put another way. If you want to increase RF by 2 then you need to commit 100% of the hardware
necessary to replicate your data twice and process reads and writes on that data. With transient
replication you don't need to commit 100% of the hardware you need to commit 5-10% additional

There is another optimization I talked about at NGCC that I am calling cheap quorums. Transient
replication alone only addresses data footprint it doesn't do as much to eliminate the per
write overhead of increasing RF. With cheap quorums you only write to a transient replica
if you are unable to write to a full replica. So transient replicas don't even see most writes,
and reads that are routed to transient replicas turn into bloom filter checks that usually
fail instead of IO. If you have a working dynamic snitch and something like rapid read protection
for writes you can handle node failures efficiently on the write path.

If 16 vnodes were workable that would be enough in most cases to maintain availability and
performance when there are node failures. Without that then transient replication (and cheap
quorums) only reduce footprint in non-failure conditions which isn't anywhere near as useful.
You need enough vnodes to ensure that a transient replica doesn't get hit with the full load
of another node on failure.

> Support a means of strongly consistent highly available replication with tunable storage
> -----------------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-13442
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction, Coordination, Distributed Metadata, Local Write-Read
>            Reporter: Ariel Weisberg
> Replication factors like RF=2 can't provide strong consistency and availability because
if a single node is lost it's impossible to reach a quorum of replicas. Stepping up to RF=3
will allow you to lose a node and still achieve quorum for reads and writes, but requires
committing additional storage.
> The requirement of a quorum for writes/reads doesn't seem to be something that can be
relaxed without additional constraints on queries, but it seems like it should be possible
to relax the requirement that 3 full copies of the entire data set are kept. What is actually
required is a covering data set for the range and we should be able to achieve a covering
data set and high availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. At that
point we don't have to read from a quorum of nodes for the repaired data. It is sufficient
to read from a single node for the repaired data and a quorum of nodes for the unrepaired
> One way to exploit this would be to have N replicas, say the last N replicas (where N
varies with RF) in the preference list, delete all repaired data after a repair completes.
Subsequent quorum reads will be able to retrieve the repaired data from any of the two full
replicas and the unrepaired data from a quorum read of any replica including the "transient"
> Configuration for something like this in NTS might be something similar to { DC1="3-1",
DC2="3-2" } where the first value is the replication factor used for consistency and the second
values is the number of transient replicas. If you specify { DC1=3, DC2=3 } then the number
of transient replicas defaults to 0 and you get the same behavior you have today.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message