cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2494) Quorum reads are not consistent
Date Sun, 17 Apr 2011 21:02:05 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020867#comment-13020867
] 

Peter Schuller commented on CASSANDRA-2494:
-------------------------------------------

As far as I can tell the consistency being asked for was never promised by Cassandra is in
fact not expected.

The expected behavior of writes is that they propagate; the difference between ONE and QUORUM
is just how many are required to receive a write prior to a return to the client with a successful
error code. For reads, that means you may get lucky at ONE or you may get lucky at QUORUM;
the positive guarantee is in the case of a *completing* QUORUM write followed by a QUORUM
read.

So just to be clear, although I don't think this is what is being asked for: As far as I know,
it has never been the case, nor the intent to promise, that a write which fails is guaranteed
not to eventually complete. Simply "fixing" reads is not enough; by design the data will be
replicated during read-repair and AES - this is how consistency is achieved in Cassandra.

However, it sounds like what is being asked for is not that they don't propagate in the event
of a write "failure", but just that reads don't see the writes until they are sufficiently
propagated to guarantee that any future QUORUM read will also see the data. I can understand
that is desirable, in the sense of achieving monotonically forward-moving data as the benchmark/test
from the e-mail thread does. Another way to look at is that maybe you never want to read data
successfully prior to achieving a certain level of replication, in order to avoid a client
ever seeing data that may suddenly go away due to e.g. a node failure in spite of said failure
not exceeding the number of failures the cluster was designed to survive.

So the key point would be the bit about guaranteeing that any "future QUORUM read will see
the data or data subsequently overwritten", and actively read-repairing and waiting for it
to happen would take care of that. It would be important to ensure that the act of ensuring
a quorum of nodes have seen the data is the important part; one should not await for a quorum
to agree on the *current* version of the data as that would create potentially unbounded round-trips
on hotly contended data.

Thing to consider: One might think about cases where read-repair is currently not done, like
range slices, and how an implementation that requires read repair for consistency affects
that.



> Quorum reads are not consistent
> -------------------------------
>
>                 Key: CASSANDRA-2494
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2494
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sean Bridges
>
> As discussed in this thread,
> http://www.mail-archive.com/user@cassandra.apache.org/msg12421.html
> Quorum reads should be consistent.  Assume we have a cluster of 3 nodes (X,Y,Z) and a
replication factor of 3. If a write of N is committed to X, but not Y and Z, then a read from
X should not return N unless the read is committed to at  least two nodes.  To ensure this,
a read from X should wait for an ack of the read repair write from either Y or Z before returning.
> Are there system tests for cassandra?  If so, there should be a test similar to the original
post in the email thread.  One thread should write 1,2,3... at consistency level ONE.  Another
thread should read at consistency level QUORUM from a random host, and verify that each read
is >= the last read.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message