Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: aaron morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E"
Message-Id: <D8074D4B-F9D3-4916-851E-AEE4D0E36A9B@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\))
Subject: Re: What does ReadRepair exactly do?
Date: Fri, 26 Oct 2012 22:53:56 +1300
References: <1351087333586-7583366.post@n2.nabble.com>
 <CCAD52D5.1388D%Dean.Hiller@nrel.gov>
 <CABT57maAstf+hJ0grGb8Xi=+XBbiK5bWN_-SVox7vo99M2WbKQ@mail.gmail.com>
 <CCAD5718.138B5%Dean.Hiller@nrel.gov>
 <1351092512818-7583372.post@n2.nabble.com>
 <CCAD6665.13918%Dean.Hiller@nrel.gov>
 <24A11BDE-052B-4D4C-82DD-980A139DAC24@thelastpickle.com>
 <1351179938911-7583395.post@n2.nabble.com>
 <CCAEBD05.13ABB%Dean.Hiller@nrel.gov>
 <CABT57mZtrR6bVBet9X6QvNL+gM3EuU_LuORmfYke8YtEHggDiQ@mail.gmail.com>
 <1351188924005-7583400.post@n2.nabble.com>
To: user@cassandra.apache.org
In-Reply-To: <1351188924005-7583400.post@n2.nabble.com>


--Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

>> replicas but to ensure we read at least one newest value as long as =
write
>> quorum succeeded beforehand and W+R > N.
>=20
This is correct.
It's not that a quorum of nodes agree it's that a quorum of nodes =
participate. If a quorum participate in both the write and read you are =
guaranteed that one node was involved in both. The wikipedia definition =
helps here "A quorum is the minimum number of members of a deliberative =
assembly necessary to conduct the business of that group" =
http://en.wikipedia.org/wiki/Quorum =20

It's a two step process: First do we have enough people to make a =
decision? Second following the rules what was the decision?

In C* the rule is to use the value with the highest time stamp. Not the =
value with the highest number of  "votes". The red boxes on this slide =
are the winning values =
http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/=
67  (thinking one of my slides in that deck may have been misleading in =
the past). In Riak the rule is to use Vector Clocks.=20

So=20
> I agree that returning val4 is the right thing to do if quorum (two) =
nodes
> among (node1,node2,node3) have the val4
Is incorrect.
We return the value with the highest time stamp returned from the nodes =
involved in the read. Only one needs to have val4.=20

> The heart of the problem
> here is that the coordinator responds to a client request "assuming" =
that
> the consistency has been achieved the moment is issues a row repair =
with the
> super-set of the resolved value; without receiving acknowledgement on =
the
> success of a repair from the replicas for a given consistency =
constraint.=20
and
> My intuition behind saying this is because we
> would respond to the client without the replicas having confirmed =
their
> meeting the consistency requirement.

It is not necessary for the coordinator to wait.=20

Consider an example: The app has stopped writing to the cluster, for a =
certain column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and =
foo:1 respectively. The last write was a successful CL QUORUM write of =
bar with timestamp 2. However node 3 did acknowledge this write for some =
reason.=20

To make it interesting the commit log volume on node 3 is full. =
Mutations are blocking in the commit log queue so any write on node 3 =
will timeout and fail, but reads are still working. We could imagine =
this is why node 3 did not commit bar:2=20

Some read examples, RR is not active:

1) Client reads from node 4 (a non replica) with CL QUOURM, request goes =
to nodes 1 and 2. Both agree on bar as value.=20
2) Client reads from node 3 with CL QUORUM, request is processed locally =
and on node 2.
	* There is a digest mismatch
	* Row Repair read runs to read from for nodes 2 and 3.
	* The super set resolves to bar:2
	* Node 3 (the coordinator) queues a delta write locally to write =
bar:2. No other delta writes are sent.
	* Node 3 returns bar:2 to the client
3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens =
and bar:2 is returned.=20
4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly =
the same thing as (2) happens and bar:2 is returned.=20
5) Client reads from node 1 as CL ONE. Read happens locally only and =
returns bar:2
6) Client reads from node 3 as CL ONE. Read happens locally only and =
returns foo:1

So:
* A read CL QUOURM will always return bar:2 even if node 3 only has =
foo:1 on disk.=20
* A read at CL ONE will return no value or any previous write.

The delta write from the Row Repair goes to a single node so R + W > N =
cannot be applied. It can almost be thought of as  internal =
implementation. The delta write from a Digest Mismatch, HH writes, full =
RR writes and nodetool repair are used to:

* Reduce the chance of a Digest Mismatch when CL > ONE
* Eventually reach a state where reads at any CL return the last write.=20=


They are not used to ensure strong consistency when R + W > N. You could =
turn those things off and R + W > N would still work.=20
=20
Hope that helps.=20


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 7:15 AM, shankarpnsn <shankarpnsn@gmail.com> wrote:

> manuzhang wrote
>> read quorum doesn't mean we read newest values from a quorum number =
of
>> replicas but to ensure we read at least one newest value as long as =
write
>> quorum succeeded beforehand and W+R > N.
>=20
> I beg to differ here. Any read/write, by definition of quorum, should =
have
> at least n/2 + 1 replicas that agree on that read/write value. =
Responding to
> the user with a newer value, even if the write creating the new value =
hasn't
> completed cannot guarantee any read consistency > 1.=20
>=20
>=20
> Hiller, Dean wrote
>>> Kind of an interesting question
>>>=20
>>> I think you are saying if a client read resolved only the two nodes =
as
>>> said in Aaron's email back to the client and read -repair was kicked =
off
>>> because of the inconsistent values and the write did not complete =
yet and
>>> I guess you would have two nodes go down to lose the value right =
after
>>> the
>>> read, and before write was finished such that the client read a =
value
>>> that
>>> was never stored in the database.  The odds of two nodes going out =
are
>>> pretty slim though.
>>> Thanks,
>>> Dean
>=20
> Bingo! I do understand that the odds of a quorum nodes going down are =
low
> and that any subsequent read would achieve a quorum. However, I'm =
wondering
> what would be the right thing to do here, given that the client has
> particularly asked for a certain consistency on the read and cassandra
> returns a value that doesn't have the consistency. The heart of the =
problem
> here is that the coordinator responds to a client request "assuming" =
that
> the consistency has been achieved the moment is issues a row repair =
with the
> super-set of the resolved value; without receiving acknowledgement on =
the
> success of a repair from the replicas for a given consistency =
constraint.=20
>=20
> In order to adhere to the given consistency specification, the row =
repair
> (due to consistent reads) should repeat the read after issuing a
> "consistency repair" to ensure if the consistency is met. Like Manu
> mentioned, this could of course lead to a number of repeat reads if =
the
> writes arrive quickly - until the read gets timed out. However, note =
that we
> would still be honoring the consistency constraint for that read.=20
>=20
>=20
>=20
> --
> View this message in context: =
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does=
-ReadRepair-exactly-do-tp7583261p7583400.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive =
at Nabble.com.


--Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite"><blockquote type=3D"cite">replicas but to =
ensure we read at least one newest value as long as write<br>quorum =
succeeded beforehand and W+R &gt; =
N.<br></blockquote></blockquote><div><blockquote =
type=3D"cite"><br></blockquote>This is correct.</div><div>It's not that =
a quorum of nodes agree it's that a quorum of nodes participate. If a =
quorum participate in both the write and read you are guaranteed that =
one node was involved in both.&nbsp;The wikipedia definition helps here =
"A&nbsp;quorum&nbsp;is the minimum number of members of =
a&nbsp;deliberative assembly&nbsp;necessary to&nbsp;conduct the business =
of that group"&nbsp;<a =
href=3D"http://en.wikipedia.org/wiki/Quorum">http://en.wikipedia.org/wiki/=
Quorum</a>&nbsp;&nbsp;</div><div><br></div><div>It's a two step process: =
First do we have enough people to make a decision? Second following the =
rules what was the decision?</div><div><br></div><div>In C* the rule is =
to use the value with the highest time stamp. Not the value with the =
highest number of &nbsp;"votes". The red boxes on this slide are the =
winning values&nbsp;<a =
href=3D"http://www.slideshare.net/aaronmorton/cassandra-does-what-code-man=
ia-2012/67">http://www.slideshare.net/aaronmorton/cassandra-does-what-code=
-mania-2012/67</a>&nbsp;&nbsp;(thinking one of my slides in that deck =
may have been misleading in the past).&nbsp;In Riak the rule is to use =
Vector =
Clocks.&nbsp;</div><div><br></div><div>So&nbsp;</div><div></div><blockquot=
e type=3D"cite"><div>I agree that returning val4 is the right thing to =
do if quorum (two) nodes<br>among (node1,node2,node3) have the =
val4</div></blockquote><div>Is incorrect.</div><div>We return the value =
with the highest time stamp returned from the nodes involved in the =
read. Only one needs to have =
val4.&nbsp;</div><div><br></div><div></div><blockquote =
type=3D"cite"><div>The heart of the problem<br>here is that the =
coordinator responds to a client request "assuming" that<br>the =
consistency has been achieved the moment is issues a row repair with =
the<br>super-set of the resolved value; without receiving =
acknowledgement on the<br>success of a repair from the replicas for a =
given consistency =
constraint.&nbsp;</div></blockquote>and<div></div><blockquote =
type=3D"cite"><div>My intuition behind saying this is because =
we<br>would respond to the client without the replicas having confirmed =
their<br>meeting the consistency =
requirement.</div></blockquote><div><br><div>It is not necessary for the =
coordinator to wait.&nbsp;</div><div><br></div><div>Consider an example: =
The app has stopped writing to the cluster, for a certain column nodes =
1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 respectively. The =
last write was a successful CL QUORUM write of bar with timestamp 2. =
However node 3 did acknowledge this write for some =
reason.&nbsp;</div><div><br></div><div>To make it interesting the commit =
log volume on node 3 is full. Mutations are blocking in the commit log =
queue so any write on node 3 will timeout and fail, but reads are still =
working. We could imagine this is why node 3 did not commit =
bar:2&nbsp;</div><div><br></div><div>Some read examples, RR is not =
active:</div><div><br></div><div>1) Client reads from node 4 (a non =
replica) with CL QUOURM, request goes to nodes 1 and 2. Both agree on =
bar as value.&nbsp;</div><div>2) Client reads from node 3 with CL =
QUORUM, request is processed locally and on node 2.</div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>*&nbsp;There is a digest mismatch</div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>* Row =
Repair read runs to read from for nodes 2 and 3.</div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>* The =
super set resolves to bar:2</div><div><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>* Node 3 (the coordinator) queues =
a delta write locally to write bar:2. No other delta writes are =
sent.</div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>* Node 3 returns bar:2 to the client</div><div>3) Client reads =
from node 3 at CL QUOURM. The same thing as (2) happens and bar:2 is =
returned.&nbsp;</div><div>4) Client reads from node 2 at CL QUOURM, read =
goes to 2 and 3. Roughly the same thing as (2) happens and bar:2 is =
returned.&nbsp;</div><div>5) Client reads from node 1 as CL ONE. Read =
happens locally only and returns bar:2</div><div>6) Client reads from =
node 3 as CL ONE. Read happens locally only and returns =
foo:1</div><div><br></div><div>So:</div><div>* A read CL QUOURM will =
always return bar:2 even if node 3 only has foo:1 on =
disk.&nbsp;</div><div>* A read at CL ONE will return no value or any =
previous write.</div><div><br></div><div>The delta write from the Row =
Repair goes to a single node so R + W &gt; N cannot be applied. It can =
almost be thought of as &nbsp;internal implementation.&nbsp;The delta =
write from a Digest Mismatch, HH writes, full RR writes and nodetool =
repair are used to:</div><div><br></div><div>* Reduce the chance of a =
Digest Mismatch when CL &gt; ONE</div><div>* Eventually reach a state =
where reads at any CL return the last =
write.&nbsp;</div><div><br></div><div>They are not used to ensure strong =
consistency when R + W &gt; N. You could turn those things off and R + W =
&gt; N would still work.&nbsp;</div><div>&nbsp;</div><div>Hope that =
helps.&nbsp;</div><div><br></div><div>
</div>
<br><div apple-content-edited=3D"true">
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: =
0px; text-transform: none; white-space: normal; widows: 2; word-spacing: =
0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></div></span></span>
</div>
<br><div><div>On 26/10/2012, at 7:15 AM, shankarpnsn &lt;<a =
href=3D"mailto:shankarpnsn@gmail.com">shankarpnsn@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">manuzhang wrote<br><blockquote type=3D"cite">read quorum =
doesn't mean we read newest values from a quorum number of<br>replicas =
but to ensure we read at least one newest value as long as =
write<br>quorum succeeded beforehand and W+R &gt; =
N.<br></blockquote><br>I beg to differ here. Any read/write, by =
definition of quorum, should have<br>at least n/2 + 1 replicas that =
agree on that read/write value. Responding to<br>the user with a newer =
value, even if the write creating the new value hasn't<br>completed =
cannot guarantee any read consistency &gt; 1. <br><br><br>Hiller, Dean =
wrote<br><blockquote type=3D"cite"><blockquote type=3D"cite">Kind of an =
interesting question<br><br>I think you are saying if a client read =
resolved only the two nodes as<br>said in Aaron's email back to the =
client and read -repair was kicked off<br>because of the inconsistent =
values and the write did not complete yet and<br>I guess you would have =
two nodes go down to lose the value right after<br>the<br>read, and =
before write was finished such that the client read a =
value<br>that<br>was never stored in the database. &nbsp;The odds of two =
nodes going out are<br>pretty slim =
though.<br>Thanks,<br>Dean<br></blockquote></blockquote><br>Bingo! I do =
understand that the odds of a quorum nodes going down are low<br>and =
that any subsequent read would achieve a quorum. However, I'm =
wondering<br>what would be the right thing to do here, given that the =
client has<br>particularly asked for a certain consistency on the read =
and cassandra<br>returns a value that doesn't have the consistency. The =
heart of the problem<br>here is that the coordinator responds to a =
client request "assuming" that<br>the consistency has been achieved the =
moment is issues a row repair with the<br>super-set of the resolved =
value; without receiving acknowledgement on the<br>success of a repair =
from the replicas for a given consistency constraint. <br><br>In order =
to adhere to the given consistency specification, the row repair<br>(due =
to consistent reads) should repeat the read after issuing =
a<br>"consistency repair" to ensure if the consistency is met. Like =
Manu<br>mentioned, this could of course lead to a number of repeat reads =
if the<br>writes arrive quickly - until the read gets timed out. =
However, note that we<br>would still be honoring the consistency =
constraint for that read. <br><br><br><br>--<br>View this message in =
context: <a =
href=3D"http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/W=
hat-does-ReadRepair-exactly-do-tp7583261p7583400.html">http://cassandra-us=
er-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly=
-do-tp7583261p7583400.html</a><br>Sent from the <a =
href=3D"mailto:cassandra-user@incubator.apache.org">cassandra-user@incubat=
or.apache.org</a> mailing list archive at <a =
href=3D"http://Nabble.com">Nabble.com</a>.<br></blockquote></div><br></div=
></body></html>=

--Apple-Mail=_F255A081-DE27-4E99-9CFA-6DEDEB03EC0E--