Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of owenzhang1990@gmail.com
 designates 209.85.215.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CCAEBD05.13ABB%Dean.Hiller@nrel.gov>
References: <1351179938911-7583395.post@n2.nabble.com>
	<CCAEBD05.13ABB%Dean.Hiller@nrel.gov>
Date: Fri, 26 Oct 2012 00:37:41 +0800
Message-ID: 
 <CABT57mZtrR6bVBet9X6QvNL+gM3EuU_LuORmfYke8YtEHggDiQ@mail.gmail.com>
Subject: Re: What does ReadRepair exactly do?
From: Manu Zhang <owenzhang1990@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=047d7b343a3662bfd404cce4d289

--047d7b343a3662bfd404cce4d289
Content-Type: text/plain; charset=ISO-8859-1

read quorum doesn't mean we read newest values from a quorum number of
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R > N.

On Fri, Oct 26, 2012 at 12:00 AM, Hiller, Dean <Dean.Hiller@nrel.gov> wrote:

> Kind of an interesting question
>
> I think you are saying if a client read resolved only the two nodes as
> said in Aaron's email back to the client and read -repair was kicked off
> because of the inconsistent values and the write did not complete yet and
> I guess you would have two nodes go down to lose the value right after the
> read, and before write was finished such that the client read a value that
> was never stored in the database.  The odds of two nodes going out are
> pretty slim though.
>
> Or, what if the node with part of the write went down, as long as the
> client stays up, he would complete his write on the other two nodes.
> Seems to me as long as two nodes don't fail, you are reading at quorum and
> fit with the consistency model since you get a value that will be on two
> nodes in the immediate future.
>
> Thanks,
> Dean
>
> On 10/25/12 9:45 AM, "shankarpnsn" <shankarpnsn@gmail.com> wrote:
>
> >aaron morton wrote
> >>> 2. You do a write operation (W1) with quorom of val=2
> >>> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete
> >>>yet)
> >> If the write has not completed then it is not a successful write at the
> >> specified CL as it could fail now.
> >>
> >> Therefor the R +W > N Strong Consistency guarantee does not apply at
> >>this
> >> exact point in time. A read to the cluster at this exact point in time
> >> using QUOURM may return val2 or val1. Again the operation W1 has not
> >> completed, if read R' starts and completes while W1 is processing it may
> >> or may not return the result of W1.
> >
> >I agree completely that it is fair to have this indeterminism in case of
> >partial/failed/in-flight writes, based on what nodes respond to a
> >subsequent
> >read.
> >
> >
> >aaron morton wrote
> >> It's import to point out the difference between Read Repair, in the
> >> context of the read_repair_chance setting, and Consistent Reads in the
> >> context of the CL setting. All of this is outside of the processing of
> >> your read request. It is separate from the stuff below.
> >>
> >> Inside the user read request when ReadCallback.get() is called and CL
> >> nodes have responded the responses are compared. If a DigestMismatch
> >> happens then a Row Repair read is started, the result of this read is
> >> returned to the user. This Row Repair read MAY detect differences, if it
> >> does it resolves the super set, sends the delta to the replicas and
> >> returns the super set value to be returned to the client.
> >>
> >>> In this case, for read R1, the value val2 does not have a quorum. Would
> >>> read
> >>> R1 return val2 or val4 ?
> >>
> >> If val4 is in the memtable on node before the second read the result
> >>will
> >> be val4.
> >> Writes that happen between the initial read and the second read after a
> >> Digest Mismatch are included in the read result.
> >
> >Thanks for clarifying this, Aaron. This is very much in line with what I
> >figured out from the code and brings me back to my initial question on the
> >point of when and what the user/client gets to see as the read result. Let
> >us, for now, consider only the repairs initiated as a part of /consistent
> >reads/. If the Row Repair (after resolving and sending the deltas to
> >replicas, but not waiting for a quorum success after the repair) returns
> >the
> >super set value immediately to the user, wouldn't it be a breach of the
> >consistent reads paradigm? My intuition behind saying this is because we
> >would respond to the client without the replicas having confirmed their
> >meeting the consistency requirement.
> >
> >I agree that returning val4 is the right thing to do if quorum (two) nodes
> >among (node1,node2,node3) have the val4 at the second read after digest
> >mismatch. But wouldn't it be incorrect to respond to user with any value
> >when the second read (after mismatch) doesn't find a quorum. So after
> >sending the deltas to the replicas as a part of the repair (still a part
> >of
> >/consistent reads/), shouldn't the value be read again to check for the
> >presence of a quorum after the repair?
> >
> >In the example we had, assume the mismatch is detected during a read R1
> >from
> >coordinator node C, that reaches node1, node2
> >State seen by C after first read R1:  <node1 = val1, node2 = val 2, node3
> >=
> >val1>
> >
> >A second read is initiated as a part of repair for consistent read of R1.
> >This second read observes the values (val1, val2) from (node1, node2) and
> >sends the corresponding row repair delta to node1. I'm guessing C cannot
> >respond back to user with val2 until C knows that node1 has actually
> >written
> >the value val2 thereby meeting the quorum. Is this interpretation correct
> >?
> >
> >
> >
> >
> >
> >
> >--
> >View this message in context:
> >
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
> >-ReadRepair-exactly-do-tp7583261p7583395.html
> >Sent from the cassandra-user@incubator.apache.org mailing list archive at
> >Nabble.com.
>
>

--047d7b343a3662bfd404cce4d289
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

read quorum doesn&#39;t mean we read newest values from a quorum number of =
replicas but to ensure we read at least one newest value as long as write q=
uorum succeeded beforehand and W+R &gt; N.=A0<br><br><div class=3D"gmail_qu=
ote">
On Fri, Oct 26, 2012 at 12:00 AM, Hiller, Dean <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:Dean.Hiller@nrel.gov" target=3D"_blank">Dean.Hiller@nrel.gov</a=
>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Kind of an interesting question<br>
<br>
I think you are saying if a client read resolved only the two nodes as<br>
said in Aaron&#39;s email back to the client and read -repair was kicked of=
f<br>
because of the inconsistent values and the write did not complete yet and<b=
r>
I guess you would have two nodes go down to lose the value right after the<=
br>
read, and before write was finished such that the client read a value that<=
br>
was never stored in the database. =A0The odds of two nodes going out are<br=
>
pretty slim though.<br>
<br>
Or, what if the node with part of the write went down, as long as the<br>
client stays up, he would complete his write on the other two nodes.<br>
Seems to me as long as two nodes don&#39;t fail, you are reading at quorum =
and<br>
fit with the consistency model since you get a value that will be on two<br=
>
nodes in the immediate future.<br>
<br>
Thanks,<br>
Dean<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On 10/25/12 9:45 AM, &quot;shankarpnsn&quot; &lt;<a href=3D"mailto:shankarp=
nsn@gmail.com">shankarpnsn@gmail.com</a>&gt; wrote:<br>
<br>
&gt;aaron morton wrote<br>
&gt;&gt;&gt; 2. You do a write operation (W1) with quorom of val=3D2<br>
&gt;&gt;&gt; node1 =3D val1 node2 =3D val2 node3 =3D val1 =A0(write val2 is=
 not complete<br>
&gt;&gt;&gt;yet)<br>
&gt;&gt; If the write has not completed then it is not a successful write a=
t the<br>
&gt;&gt; specified CL as it could fail now.<br>
&gt;&gt;<br>
&gt;&gt; Therefor the R +W &gt; N Strong Consistency guarantee does not app=
ly at<br>
&gt;&gt;this<br>
&gt;&gt; exact point in time. A read to the cluster at this exact point in =
time<br>
&gt;&gt; using QUOURM may return val2 or val1. Again the operation W1 has n=
ot<br>
&gt;&gt; completed, if read R&#39; starts and completes while W1 is process=
ing it may<br>
&gt;&gt; or may not return the result of W1.<br>
&gt;<br>
&gt;I agree completely that it is fair to have this indeterminism in case o=
f<br>
&gt;partial/failed/in-flight writes, based on what nodes respond to a<br>
&gt;subsequent<br>
&gt;read.<br>
&gt;<br>
&gt;<br>
&gt;aaron morton wrote<br>
&gt;&gt; It&#39;s import to point out the difference between Read Repair, i=
n the<br>
&gt;&gt; context of the read_repair_chance setting, and Consistent Reads in=
 the<br>
&gt;&gt; context of the CL setting. All of this is outside of the processin=
g of<br>
&gt;&gt; your read request. It is separate from the stuff below.<br>
&gt;&gt;<br>
&gt;&gt; Inside the user read request when ReadCallback.get() is called and=
 CL<br>
&gt;&gt; nodes have responded the responses are compared. If a DigestMismat=
ch<br>
&gt;&gt; happens then a Row Repair read is started, the result of this read=
 is<br>
&gt;&gt; returned to the user. This Row Repair read MAY detect differences,=
 if it<br>
&gt;&gt; does it resolves the super set, sends the delta to the replicas an=
d<br>
&gt;&gt; returns the super set value to be returned to the client.<br>
&gt;&gt;<br>
&gt;&gt;&gt; In this case, for read R1, the value val2 does not have a quor=
um. Would<br>
&gt;&gt;&gt; read<br>
&gt;&gt;&gt; R1 return val2 or val4 ?<br>
&gt;&gt;<br>
&gt;&gt; If val4 is in the memtable on node before the second read the resu=
lt<br>
&gt;&gt;will<br>
&gt;&gt; be val4.<br>
&gt;&gt; Writes that happen between the initial read and the second read af=
ter a<br>
&gt;&gt; Digest Mismatch are included in the read result.<br>
&gt;<br>
&gt;Thanks for clarifying this, Aaron. This is very much in line with what =
I<br>
&gt;figured out from the code and brings me back to my initial question on =
the<br>
&gt;point of when and what the user/client gets to see as the read result. =
Let<br>
&gt;us, for now, consider only the repairs initiated as a part of /consiste=
nt<br>
&gt;reads/. If the Row Repair (after resolving and sending the deltas to<br=
>
&gt;replicas, but not waiting for a quorum success after the repair) return=
s<br>
&gt;the<br>
&gt;super set value immediately to the user, wouldn&#39;t it be a breach of=
 the<br>
&gt;consistent reads paradigm? My intuition behind saying this is because w=
e<br>
&gt;would respond to the client without the replicas having confirmed their=
<br>
&gt;meeting the consistency requirement.<br>
&gt;<br>
&gt;I agree that returning val4 is the right thing to do if quorum (two) no=
des<br>
&gt;among (node1,node2,node3) have the val4 at the second read after digest=
<br>
&gt;mismatch. But wouldn&#39;t it be incorrect to respond to user with any =
value<br>
&gt;when the second read (after mismatch) doesn&#39;t find a quorum. So aft=
er<br>
&gt;sending the deltas to the replicas as a part of the repair (still a par=
t<br>
&gt;of<br>
&gt;/consistent reads/), shouldn&#39;t the value be read again to check for=
 the<br>
&gt;presence of a quorum after the repair?<br>
&gt;<br>
&gt;In the example we had, assume the mismatch is detected during a read R1=
<br>
&gt;from<br>
&gt;coordinator node C, that reaches node1, node2<br>
&gt;State seen by C after first read R1: =A0&lt;node1 =3D val1, node2 =3D v=
al 2, node3<br>
&gt;=3D<br>
&gt;val1&gt;<br>
&gt;<br>
&gt;A second read is initiated as a part of repair for consistent read of R=
1.<br>
&gt;This second read observes the values (val1, val2) from (node1, node2) a=
nd<br>
&gt;sends the corresponding row repair delta to node1. I&#39;m guessing C c=
annot<br>
&gt;respond back to user with val2 until C knows that node1 has actually<br=
>
&gt;written<br>
&gt;the value val2 thereby meeting the quorum. Is this interpretation corre=
ct<br>
&gt;?<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;--<br>
&gt;View this message in context:<br>
&gt;<a href=3D"http://cassandra-user-incubator-apache-org.3065146.n2.nabble=
.com/What-does" target=3D"_blank">http://cassandra-user-incubator-apache-or=
g.3065146.n2.nabble.com/What-does</a><br>
&gt;-ReadRepair-exactly-do-tp7583261p7583395.html<br>
&gt;Sent from the <a href=3D"mailto:cassandra-user@incubator.apache.org">ca=
ssandra-user@incubator.apache.org</a> mailing list archive at<br>
&gt;Nabble.com.<br>
<br>
</div></div></blockquote></div><br>

--047d7b343a3662bfd404cce4d289--