couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Samuel Newson <rnew...@apache.org>
Subject Re: Could CouchDB 2.0 fix actual read quorum?
Date Tue, 31 Mar 2015 18:47:11 GMT

It’s testament to my friendship with Mike that we can disagree on such things and remain
friends. I am sorry he misled you, though.

CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at least in the formal
sense, the only one that matters, this is unfortunately sloppy language in too many places
to correct.

The r= and w= parameters control only how many of the n possible responses are collected before
returning an http response.

It’s not true that returning 202 in the situation where one write is made but fewer than
'r' writes are made means we’ve chosen availability over consistency since even if we returned
a 500 or closed the connection without responding, a subsequent GET could return the document
(a probability that increases over time as anti-entropy makes the missing copies). A write
attempt that returned a 409 could, likewise, introduce a new edit branch into the document,
which might then 'win', altering the results of a subsequent GET.

The essential thing to remember is this: the ’n’ copies of your data are completely independent
when written/read by the clustered layer (fabric). It is internal replication (anti-entropy)
that converges those copies, pair-wise, to the same eventual state. Fabric is converting the
3 independent results into a single result as best it can. Older versions did not expose the
201 vs 202 distinction, calling both of them 201. I do agree with you that there’s little
value in the 202 distinction. About the only thing you could do is investigate your cluster
for connectivity issues or overloading if you get a sustained period of 202’s, as it would
be an indicator that the system is partitioned.

In order to achieve your goals, CouchDB 2.0 would have to ensure that the result of a write
did not change after the fact. That is, anti-entropy would need to be disabled, or somehow
agree to roll forward or backward based on the initial circumstances. In short, we’d have
to introduce strong consistency (paxos or raft or zab, say). While this would be a great feature
to add, it’s not currently present, and no amount of twiddling the status codes will achieve
it. We’d rather be honest about our position on the CAP triangle.

B.


> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <nate-lists@calftrail.com> wrote:
> 
> A technical co-founder of Cloudant agreed that this was a bug when I first hit it a few
years ago. I found back the original thread here — this is the discussion I was trying to
recall in my OP: 
> It sounds like perhaps there is a related issue tracked internally at Cloudant as a result
of that conversation.
> 
> JamesM, thanks for your support here and tracking this down. 203 seemed like the best
status code to "steal" for this to me too. Best wishes in getting this fixed!
> 
> regards,
> -natevw
> 
> 
> On Mar 25, 2015, at 4:49 AM, Robert Newson <rnewson@apache.org> wrote:
> 
>> 2.0 is explicitly an AP system, the behaviour you describe is not classified as a
bug. 
>> 
>> Anti-entropy is the main reason that you cannot get strong consistency from the system,
it will transform "failed" writes (those that succeeded on one node but fewer than R nodes)
into success (N copies) as long as the nodes have enough healthy uptime. 
>> 
>> True of cloudant and 2.0. 
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2015, at 15:14, Mutton, James <jmutton@akamai.com> wrote:
>>> 
>>> Funny you should mention it.  I drafted an email in early February to queue up
the same discussion whenever I could get involved again (which I promptly forgot about). 
What happens currently in 2.0 appears unchanged from earlier versions.  When R is not satisfied
in fabric, fabric_doc_open:handle_message eventually responds with a {stop, …}  but leaves
the acc-state as the original r_not_met which triggers a read_repair from the response handler.
 read_repair results in an {ok, …} with the only doc available, because no other docs are
in the list.  The final doc returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req
is simply {ok, Doc}, which has now lost the fact that the answer was not complete.
>>> 
>>> This seems straightforward to fix by a change in fabric_open_doc:handle_response
and read_repair.  handle_response knows whether it has R met and could pass that forward,
or allow read-repair to pass it forward if read_repair is able to satisfy acc.r.  I can’t
speak for community interest in the behavior of sending a 202, but it’s something I’d
definitely like for the same reasons you cite.  Plus it just seems disconnected to do it on
writes but not reads.
>>> 
>>> Cheers,
>>> </JamesM>
>>> 
>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <nate-lists@calftrail.com>
wrote:
>>>> 
>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending
my fermata-couchdb plugin today and realized that perhaps the Apache release of BigCouch as
CouchDB 2.0 might provide an opportunity to fix a serious issue I had using Cloudant's implementation.
>>>> 
>>>> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
for some additional background/explanation, but my understanding is that Cloudant for all
practical purposes ignores the read durability parameter. So you can write with ?w=N to attempt
some level of quorum, and get a 202 back if that quorum is unment. _However_ when you ?r=N
it really doesn't matter if only <N nodes are available…if even just a single available
node has some version of the requested document you will get a successful response (!).
>>>> 
>>>> So in practice, there's no way to actually use the quasi-Dynamo features
to dynamically _choose_ between consistency or availability — when it comes time to read
back a consistent result, BigCouch instead just always gives you availability* regardless
of what a given request actually needs. (In my usage I ended up treating a 202 write as a
500, rather than proceeding with no way of ever knowing whether a write did NOT ACTUALLY conflict
or just hadn't YET because $who_knows_how_many nodes were still down…)
>>>> 
>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a Cloudant
engineer (or support personnel at least) but could not be quickly fixed as it could introduce
backwards-compatibility concerns. So…
>>>> 
>>>> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If
true, could this read durability issue now be fixed during the merge?
>>>> 
>>>> thanks,
>>>> -natevw
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime of *any*
Couch fork…
>>> 
> 


Mime
View raw message