incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: What does ReadRepair exactly do?
Date Thu, 25 Oct 2012 08:29:41 GMT
It's import to point out the difference between Read Repair, in the context of the read_repair_chance
setting, and Consistent Reads in the context of the CL setting. 

If RR is active on a request it means the request is sent to ALL UP nodes for the key and
the RR process is ASYNC to the request.    If all of the nodes involved in the request return
to the coordinator before rpc_timeout ReadCallback.maybeResolveForRepair() will put a repair
task into the READ_REPAIR stage. This will compare the values and IF there is a DigestMismatch
it will start a Row Repair read that reads the data from all nodes and MAY result in differences
being detected and fixed. 

All of this is outside of the processing of your read request. It is separate from the stuff
below.

Inside the user read request when ReadCallback.get() is called and CL nodes have responded
the responses are compared. If a DigestMismatch happens then a Row Repair read is started,
the result of this read is returned to the user. This Row Repair read MAY detect differences,
if it does it resolves the super set, sends the delta to the replicas and returns the super
set value to be returned to the client. 

> I'm still missing, how read repairs behave. Just extending your example for
> the following case: 
The example does not use Read Repair, it is handled by Consistent Reads. 

The purpose of RR is to reduce the probability that a read in the future using any of the
replicas will result in a Digest Mismatch. "Any of the replicas" means ones that were not
necessary for this specific read request. 

> 2. You do a write operation (W1) with quorom of val=2
> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
If the write has not completed then it is not a successful write at the specified CL as it
could fail now.

Therefor the R +W > N Strong Consistency guarantee does not apply at this exact point in
time. A read to the cluster at this exact point in time using QUOURM may return val2 or val1.
Again the operation W1 has not completed, if read R' starts and completes while W1 is processing
it may or may not return the result of W1.
 
> In this case, for read R1, the value val2 does not have a quorum. Would read
> R1 return val2 or val4 ? 

If val4 is in the memtable on node before the second read the result will be val4.  
Writes that happen between the initial read and the second read after a Digest Mismatch are
included in the read result.

The way I think about consistency is "what value do reads see if writes stop":

* If you have R + W > N, so all writes succeeded at CL QUOURM, all successful reads are
guaranteed to see the last write. 
* If you are using a low CL and/or had a failed writes at QUOURM then R +  W < N. All successful
reads will *eventually* see the last value written, and they are guaranteed to return the
value of a previous write or no value. Eventually background Read Repair, Hinted Handoff 
or nodetool repair will repair the inconsistency. 

Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 4:39 AM, "Hiller, Dean" <Dean.Hiller@nrel.gov> wrote:

>> Thanks Zhang. But, this again seems a little strange thing to do, since
>> one
>> (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
>> read failure while there are still enough number of replicas (R1 and R3)
>> live to satisfy a read.
> 
> 
> He means in the case where all 3 nodes are liveŠ.if a node is down,
> naturally it redirects to the other node and still succeeds because it
> found 2 nodes even with one node down(feel free to test this live though
> !!!!!)
> 
>> 
>> Thanks for the example Dean. This definitely clears things up when you
>> have
>> an overlap between the read and the write, and one comes after the other.
>> I'm still missing, how read repairs behave. Just extending your example
>> for
>> the following case:
>> 
>> 1. node1 = val1 node2 = val1 node3 = val1
>> 
>> 2. You do a write operation (W1) with quorom of val=2
>> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
>> 
>> 3. Now with a read (R1) from node1 and node2, a read repair will be
>> initiated that needs to write val2 on node 1.
>> node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not
>> complete
>> yet)
>> 
>> 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
>> now arrives at node 1 but sees a newer value val4.
>> node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete,
>> read
>> repair val2 not complete)
>> 
>> In this case, for read R1, the value val2 does not have a quorum. Would
>> read
>> R1 return val2 or val4 ?
> 
>> 
> At this point as Manu suggests, you need to look at the code but most
> likely what happens is they lock that row, receive the write in memory(ie.
> Not losing it) and return to client, caching it so as soon as read-repair
> is over, it will write that next value.  Ie. Your client would receive
> val2 and val4 would be the value in the database right after you received
> val2.  Ie. When a client interacts with cassandra and you have tons of
> writes to a row, val1, val2, val3, val4 in a short time period, just like
> a normal database, your client may get one of those 4 values depending on
> here the read gets inserted in the order of the writesŠsame as a normal
> RDBMS.  The only thing you don't have is the atomic nature with other rows.
> 
> NOTICE: they would not have to cache val4 very long, and if a newer write
> came in, they would just replace it with that newer val and cache that one
> instead so it would not be a queueŠbut this is all just a guessŠread the
> code if you really want to know.
> 
>> 
>> 
>> Zhang, Manu wrote
>>> And we don't send read request to all of the three replicas (R1, R2, R3)
>>> if CL=QUOROM; just 2 of them depending on proximity
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>> -ReadRepair-exactly-do-tp7583261p7583372.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at
>> Nabble.com.
> 


Mime
View raw message