Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E957DC3F for ; Fri, 26 Oct 2012 09:54:32 +0000 (UTC) Received: (qmail 9114 invoked by uid 500); 26 Oct 2012 09:54:29 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 8907 invoked by uid 500); 26 Oct 2012 09:54:29 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 8895 invoked by uid 99); 26 Oct 2012 09:54:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Oct 2012 09:54:29 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a48.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Oct 2012 09:54:22 +0000 Received: from homiemail-a48.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a48.g.dreamhost.com (Postfix) with ESMTP id CD8F14F805C for ; Fri, 26 Oct 2012 02:53:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=5RACKmm0seTb3L0073Re3NAWSg 4=; b=BpTknpdFdTL3+WIzyK6CNiGnIfvd4fczn65NYX2R8sOV2gGHfEARxekZXk o0O8xa96z3jli1+xS5uH5Nl6b6sDrpyQ5k0s84VK/oOPihfG7BNqGZUq77tFAPR5 lu+wC1KUBSdCfXkBP2qQOXzZ5zIsE18jaImPwFiX6Nfbge8Bg= Received: from [172.16.1.10] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a48.g.dreamhost.com (Postfix) with ESMTPSA id F33024F805B for ; Fri, 26 Oct 2012 02:53:58 -0700 (PDT) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Subject: Re: What does ReadRepair exactly do? Date: Fri, 26 Oct 2012 22:53:56 +1300 References: <1351087333586-7583366.post@n2.nabble.com> <1351092512818-7583372.post@n2.nabble.com> <24A11BDE-052B-4D4C-82DD-980A139DAC24@thelastpickle.com> <1351179938911-7583395.post@n2.nabble.com> <1351188924005-7583400.post@n2.nabble.com> To: user@cassandra.apache.org In-Reply-To: <1351188924005-7583400.post@n2.nabble.com> X-Mailer: Apple Mail (2.1498) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii >> replicas but to ensure we read at least one newest value as long as = write >> quorum succeeded beforehand and W+R > N. >=20 This is correct. It's not that a quorum of nodes agree it's that a quorum of nodes = participate. If a quorum participate in both the write and read you are = guaranteed that one node was involved in both. The wikipedia definition = helps here "A quorum is the minimum number of members of a deliberative = assembly necessary to conduct the business of that group" = http://en.wikipedia.org/wiki/Quorum =20 It's a two step process: First do we have enough people to make a = decision? Second following the rules what was the decision? In C* the rule is to use the value with the highest time stamp. Not the = value with the highest number of "votes". The red boxes on this slide = are the winning values = http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/= 67 (thinking one of my slides in that deck may have been misleading in = the past). In Riak the rule is to use Vector Clocks.=20 So=20 > I agree that returning val4 is the right thing to do if quorum (two) = nodes > among (node1,node2,node3) have the val4 Is incorrect. We return the value with the highest time stamp returned from the nodes = involved in the read. Only one needs to have val4.=20 > The heart of the problem > here is that the coordinator responds to a client request "assuming" = that > the consistency has been achieved the moment is issues a row repair = with the > super-set of the resolved value; without receiving acknowledgement on = the > success of a repair from the replicas for a given consistency = constraint.=20 and > My intuition behind saying this is because we > would respond to the client without the replicas having confirmed = their > meeting the consistency requirement. It is not necessary for the coordinator to wait.=20 Consider an example: The app has stopped writing to the cluster, for a = certain column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and = foo:1 respectively. The last write was a successful CL QUORUM write of = bar with timestamp 2. However node 3 did acknowledge this write for some = reason.=20 To make it interesting the commit log volume on node 3 is full. = Mutations are blocking in the commit log queue so any write on node 3 = will timeout and fail, but reads are still working. We could imagine = this is why node 3 did not commit bar:2=20 Some read examples, RR is not active: 1) Client reads from node 4 (a non replica) with CL QUOURM, request goes = to nodes 1 and 2. Both agree on bar as value.=20 2) Client reads from node 3 with CL QUORUM, request is processed locally = and on node 2. * There is a digest mismatch * Row Repair read runs to read from for nodes 2 and 3. * The super set resolves to bar:2 * Node 3 (the coordinator) queues a delta write locally to write = bar:2. No other delta writes are sent. * Node 3 returns bar:2 to the client 3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens = and bar:2 is returned.=20 4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly = the same thing as (2) happens and bar:2 is returned.=20 5) Client reads from node 1 as CL ONE. Read happens locally only and = returns bar:2 6) Client reads from node 3 as CL ONE. Read happens locally only and = returns foo:1 So: * A read CL QUOURM will always return bar:2 even if node 3 only has = foo:1 on disk.=20 * A read at CL ONE will return no value or any previous write. The delta write from the Row Repair goes to a single node so R + W > N = cannot be applied. It can almost be thought of as internal = implementation. The delta write from a Digest Mismatch, HH writes, full = RR writes and nodetool repair are used to: * Reduce the chance of a Digest Mismatch when CL > ONE * Eventually reach a state where reads at any CL return the last write.=20= They are not used to ensure strong consistency when R + W > N. You could = turn those things off and R + W > N would still work.=20 =20 Hope that helps.=20 ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/10/2012, at 7:15 AM, shankarpnsn wrote: > manuzhang wrote >> read quorum doesn't mean we read newest values from a quorum number = of >> replicas but to ensure we read at least one newest value as long as = write >> quorum succeeded beforehand and W+R > N. >=20 > I beg to differ here. Any read/write, by definition of quorum, should = have > at least n/2 + 1 replicas that agree on that read/write value. = Responding to > the user with a newer value, even if the write creating the new value = hasn't > completed cannot guarantee any read consistency > 1.=20 >=20 >=20 > Hiller, Dean wrote >>> Kind of an interesting question >>>=20 >>> I think you are saying if a client read resolved only the two nodes = as >>> said in Aaron's email back to the client and read -repair was kicked = off >>> because of the inconsistent values and the write did not complete = yet and >>> I guess you would have two nodes go down to lose the value right = after >>> the >>> read, and before write was finished such that the client read a = value >>> that >>> was never stored in the database. The odds of two nodes going out = are >>> pretty slim though. >>> Thanks, >>> Dean >=20 > Bingo! I do understand that the odds of a quorum nodes going down are = low > and that any subsequent read would achieve a quorum. However, I'm = wondering > what would be the right thing to do here, given that the client has > particularly asked for a certain consistency on the read and cassandra > returns a value that doesn't have the consistency. The heart of the = problem > here is that the coordinator responds to a client request "assuming" = that > the consistency has been achieved the moment is issues a row repair = with the > super-set of the resolved value; without receiving acknowledgement on = the > success of a repair from the replicas for a given consistency = constraint.=20 >=20 > In order to adhere to the given consistency specification, the row = repair > (due to consistent reads) should repeat the read after issuing a > "consistency repair" to ensure if the consistency is met. Like Manu > mentioned, this could of course lead to a number of repeat reads if = the > writes arrive quickly - until the read gets timed out. However, note = that we > would still be honoring the consistency constraint for that read.=20 >=20 >=20 >=20 > -- > View this message in context: = http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does= -ReadRepair-exactly-do-tp7583261p7583400.html > Sent from the cassandra-user@incubator.apache.org mailing list archive = at Nabble.com. --Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
replicas but to = ensure we read at least one newest value as long as write
quorum = succeeded beforehand and W+R > = N.

This is correct.
It's not that = a quorum of nodes agree it's that a quorum of nodes participate. If a = quorum participate in both the write and read you are guaranteed that = one node was involved in both. The wikipedia definition helps here = "A quorum is the minimum number of members of = a deliberative assembly necessary to conduct the business = of that group" http://en.wikipedia.org/wiki/= Quorum  

It's a two step process: = First do we have enough people to make a decision? Second following the = rules what was the decision?

In C* the rule is = to use the value with the highest time stamp. Not the value with the = highest number of  "votes". The red boxes on this slide are the = winning values http://www.slideshare.net/aaronmorton/cassandra-does-what-code= -mania-2012/67  (thinking one of my slides in that deck = may have been misleading in the past). In Riak the rule is to use = Vector = Clocks. 

So 
I agree that returning val4 is the right thing to = do if quorum (two) nodes
among (node1,node2,node3) have the = val4
Is incorrect.
We return the value = with the highest time stamp returned from the nodes involved in the = read. Only one needs to have = val4. 

The heart of the problem
here is that the = coordinator responds to a client request "assuming" that
the = consistency has been achieved the moment is issues a row repair with = the
super-set of the resolved value; without receiving = acknowledgement on the
success of a repair from the replicas for a = given consistency = constraint. 
and
My intuition behind saying this is because = we
would respond to the client without the replicas having confirmed = their
meeting the consistency = requirement.

It is not necessary for the = coordinator to wait. 

Consider an example: = The app has stopped writing to the cluster, for a certain column nodes = 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 respectively. The = last write was a successful CL QUORUM write of bar with timestamp 2. = However node 3 did acknowledge this write for some = reason. 

To make it interesting the commit = log volume on node 3 is full. Mutations are blocking in the commit log = queue so any write on node 3 will timeout and fail, but reads are still = working. We could imagine this is why node 3 did not commit = bar:2 

Some read examples, RR is not = active:

1) Client reads from node 4 (a non = replica) with CL QUOURM, request goes to nodes 1 and 2. Both agree on = bar as value. 
2) Client reads from node 3 with CL = QUORUM, request is processed locally and on node 2.
= * There is a digest mismatch
* Row = Repair read runs to read from for nodes 2 and 3.
* The = super set resolves to bar:2
* Node 3 (the coordinator) queues = a delta write locally to write bar:2. No other delta writes are = sent.
= * Node 3 returns bar:2 to the client
3) Client reads = from node 3 at CL QUOURM. The same thing as (2) happens and bar:2 is = returned. 
4) Client reads from node 2 at CL QUOURM, read = goes to 2 and 3. Roughly the same thing as (2) happens and bar:2 is = returned. 
5) Client reads from node 1 as CL ONE. Read = happens locally only and returns bar:2
6) Client reads from = node 3 as CL ONE. Read happens locally only and returns = foo:1

So:
* A read CL QUOURM will = always return bar:2 even if node 3 only has foo:1 on = disk. 
* A read at CL ONE will return no value or any = previous write.

The delta write from the Row = Repair goes to a single node so R + W > N cannot be applied. It can = almost be thought of as  internal implementation. The delta = write from a Digest Mismatch, HH writes, full RR writes and nodetool = repair are used to:

* Reduce the chance of a = Digest Mismatch when CL > ONE
* Eventually reach a state = where reads at any CL return the last = write. 

They are not used to ensure strong = consistency when R + W > N. You could turn those things off and R + W = > N would still work. 
 
Hope that = helps. 


http://www.thelastpickle.com

On 26/10/2012, at 7:15 AM, shankarpnsn <shankarpnsn@gmail.com> = wrote:

manuzhang wrote
read quorum = doesn't mean we read newest values from a quorum number of
replicas = but to ensure we read at least one newest value as long as = write
quorum succeeded beforehand and W+R > = N.

I beg to differ here. Any read/write, by = definition of quorum, should have
at least n/2 + 1 replicas that = agree on that read/write value. Responding to
the user with a newer = value, even if the write creating the new value hasn't
completed = cannot guarantee any read consistency > 1.


Hiller, Dean = wrote
Kind of an = interesting question

I think you are saying if a client read = resolved only the two nodes as
said in Aaron's email back to the = client and read -repair was kicked off
because of the inconsistent = values and the write did not complete yet and
I guess you would have = two nodes go down to lose the value right after
the
read, and = before write was finished such that the client read a = value
that
was never stored in the database.  The odds of two = nodes going out are
pretty slim = though.
Thanks,
Dean

Bingo! I do = understand that the odds of a quorum nodes going down are low
and = that any subsequent read would achieve a quorum. However, I'm = wondering
what would be the right thing to do here, given that the = client has
particularly asked for a certain consistency on the read = and cassandra
returns a value that doesn't have the consistency. The = heart of the problem
here is that the coordinator responds to a = client request "assuming" that
the consistency has been achieved the = moment is issues a row repair with the
super-set of the resolved = value; without receiving acknowledgement on the
success of a repair = from the replicas for a given consistency constraint.

In order = to adhere to the given consistency specification, the row repair
(due = to consistent reads) should repeat the read after issuing = a
"consistency repair" to ensure if the consistency is met. Like = Manu
mentioned, this could of course lead to a number of repeat reads = if the
writes arrive quickly - until the read gets timed out. = However, note that we
would still be honoring the consistency = constraint for that read.



--
View this message in = context: http://cassandra-us= er-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly= -do-tp7583261p7583400.html
Sent from the cassandra-user@incubat= or.apache.org mailing list archive at Nabble.com.

= --Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E--