couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Samuel Newson <rnew...@apache.org>
Subject Re: Could CouchDB 2.0 fix actual read quorum?
Date Thu, 02 Apr 2015 08:22:29 GMT

Paul outlined his previous efforts to introduce this indication, and the problems he faced
doing so. Can we come up with an acceptable mechanism?

A different status code will break a lot of users. While the http spec says you can treat
any 2xx code as success, plenty of libraries, etc, only recognise 201 / 202 as successful
write and 200 (and maybe 204, 206) for reads.

My preference is for a change that "can’t" break anyone, which I think only leaves an "X-CouchDB-R-Met:
2" response header, which isn’t the most pleasant thing.

Suggestions?

B.


> On 1 Apr 2015, at 06:55, Mutton, James <jmutton@akamai.com> wrote:
> 
> For at least my part of it, I agree with Adam. Bigcouch has made an effort to inform
in the case of a failure to apply W. I've seen it lead to confusion when the same logic was
not applied on R.
> 
> I also agree that W and R are not binding contracts. There's no agreement protocol to
assure that W is met before being committed to disk. But they are exposed as a blocking parameter
of the request, so notification being consistent appeared to me to be the best compromise
(vs straight up removal).
> 
> </JamesM>
> 
> 
>> On Mar 31, 2015, at 13:15, Robert Newson <rnewson@apache.org> wrote:
>> 
>> 
>> If a way can be found that doesn't break things that can be sent in all or most cases,
sure. It's what a user can really infer from that which I focused on. Not as much, I think,
as users that want that info really want. 
>> 
>> 
>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <kocolosk@apache.org> wrote:
>>> 
>>> I hope we can all agree that CouchDB should inform the user when it is unable
to satisfy the requested read "quorum".
>>> 
>>> Adam
>>> 
>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <paul.joseph.davis@gmail.com>
wrote:
>>>> 
>>>> Sounds like there's a bit of confusion here.
>>>> 
>>>> What Nathan is asking for is the ability to have Couch respond with some
>>>> information on the actual number of replicas that responded to a read
>>>> request. That way a user could tell that they issued an r=2 request when
>>>> only r=1 was actually performed. Depending on your point of view in an MVCC
>>>> world this is either a bug or a feature. :)
>>>> 
>>>> It was generally agreed upon that if we could return this information it
>>>> would be beneficial. Although what happened when I started implementing
>>>> this patch was that we are either only able to return it in a subset of
>>>> cases where it happens, return it inconsistently between various responses,
>>>> or break replication.
>>>> 
>>>> The three general methods for this would be to either include a new
>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>> requested read quorum was actually met for the document. The second was to
>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>> described.
>>>> 
>>>> The _r_met member was thought to be the best, but unfortunately that breaks
>>>> replication with older clients because we throw an error rather than ignore
>>>> any unknown underscore prefixed field name. Thus having something that was
>>>> just dynamically injected into the document body was a non-starter.
>>>> Unfortunately, if we don't inject into the document body then we limit
>>>> ourselves to only the set of APIs where a single document is returned. This
>>>> is due to both streaming semantics (we can't buffer an entire response in
>>>> memory for large requests to _all_docs) as well as multi-doc responses (a
>>>> single boolean doesn't say which document may have not had a properly met
>>>> R).
>>>> 
>>>> On top of that, the other confusing part of meeting the read quorum is that
>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>> documents with different revision histories. For instance, if we read two
>>>> docs, we have technically made the r=2 requirement, but what should our
>>>> response be if those two revisions are different (technically, in this case
>>>> we wait for the third response, but the decision on what to return for the
>>>> "r met" value is still unclear).
>>>> 
>>>> While I think everyone is in agreement that it'd be nice to return some of
>>>> the information about the copies read, I think its much less clear what and
>>>> how it should be returned in the multitude of cases that we can specify an
>>>> value for R.
>>>> 
>>>> While that doesn't offer a concrete path forward, hopefully it clarifies
>>>> some of the issues at hand.
>>>> 
>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <rnewson@apache.org>
>>>> wrote:
>>>> 
>>>>> 
>>>>> It’s testament to my friendship with Mike that we can disagree on such
>>>>> things and remain friends. I am sorry he misled you, though.
>>>>> 
>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at all,
at
>>>>> least in the formal sense, the only one that matters, this is unfortunately
>>>>> sloppy language in too many places to correct.
>>>>> 
>>>>> The r= and w= parameters control only how many of the n possible responses
>>>>> are collected before returning an http response.
>>>>> 
>>>>> It’s not true that returning 202 in the situation where one write is
made
>>>>> but fewer than 'r' writes are made means we’ve chosen availability
over
>>>>> consistency since even if we returned a 500 or closed the connection
>>>>> without responding, a subsequent GET could return the document (a
>>>>> probability that increases over time as anti-entropy makes the missing
>>>>> copies). A write attempt that returned a 409 could, likewise, introduce
a
>>>>> new edit branch into the document, which might then 'win', altering the
>>>>> results of a subsequent GET.
>>>>> 
>>>>> The essential thing to remember is this: the ’n’ copies of your data
are
>>>>> completely independent when written/read by the clustered layer (fabric).
>>>>> It is internal replication (anti-entropy) that converges those copies,
>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>> independent results into a single result as best it can. Older versions
did
>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I do
agree
>>>>> with you that there’s little value in the 202 distinction. About the
only
>>>>> thing you could do is investigate your cluster for connectivity issues
or
>>>>> overloading if you get a sustained period of 202’s, as it would be
an
>>>>> indicator that the system is partitioned.
>>>>> 
>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure that
the
>>>>> result of a write did not change after the fact. That is, anti-entropy
>>>>> would need to be disabled, or somehow agree to roll forward or backward
>>>>> based on the initial circumstances. In short, we’d have to introduce
strong
>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>> feature to add, it’s not currently present, and no amount of twiddling
the
>>>>> status codes will achieve it. We’d rather be honest about our position
on
>>>>> the CAP triangle.
>>>>> 
>>>>> B.
>>>>> 
>>>>> 
>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <nate-lists@calftrail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> A technical co-founder of Cloudant agreed that this was a bug when
I
>>>>> first hit it a few years ago. I found back the original thread here —
this
>>>>> is the discussion I was trying to recall in my OP:
>>>>>> It sounds like perhaps there is a related issue tracked internally
at
>>>>> Cloudant as a result of that conversation.
>>>>>> 
>>>>>> JamesM, thanks for your support here and tracking this down. 203
seemed
>>>>> like the best status code to "steal" for this to me too. Best wishes
in
>>>>> getting this fixed!
>>>>>> 
>>>>>> regards,
>>>>>> -natevw
>>>>>> 
>>>>>> 
>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <rnewson@apache.org>
wrote:
>>>>>>> 
>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is
not
>>>>> classified as a bug.
>>>>>>> 
>>>>>>> Anti-entropy is the main reason that you cannot get strong consistency
>>>>> from the system, it will transform "failed" writes (those that succeeded
on
>>>>> one node but fewer than R nodes) into success (N copies) as long as the
>>>>> nodes have enough healthy uptime.
>>>>>>> 
>>>>>>> True of cloudant and 2.0.
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <jmutton@akamai.com>
wrote:
>>>>>>>> 
>>>>>>>> Funny you should mention it.  I drafted an email in early
February to
>>>>> queue up the same discussion whenever I could get involved again (which
I
>>>>> promptly forgot about).  What happens currently in 2.0 appears unchanged
>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
 but
>>>>> leaves the acc-state as the original r_not_met which triggers a read_repair
>>>>> from the response handler.  read_repair results in an {ok, …} with
the only
>>>>> doc available, because no other docs are in the list.  The final doc
>>>>> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req
is
>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>> complete.
>>>>>>>> 
>>>>>>>> This seems straightforward to fix by a change in
>>>>> fabric_open_doc:handle_response and read_repair.  handle_response knows
>>>>> whether it has R met and could pass that forward, or allow read-repair
to
>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak
for
>>>>> community interest in the behavior of sending a 202, but it’s something
I’d
>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>> disconnected to do it on writes but not reads.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> </JamesM>
>>>>>>>> 
>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>> nate-lists@calftrail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap
but I was
>>>>> extending my fermata-couchdb plugin today and realized that perhaps the
>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an opportunity
to
>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>> 
>>>>>>>>> See
>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
for
>>>>> some additional background/explanation, but my understanding is that
>>>>> Cloudant for all practical purposes ignores the read durability parameter.
>>>>> So you can write with ?w=N to attempt some level of quorum, and get a
202
>>>>> back if that quorum is unment. _However_ when you ?r=N it really doesn't
>>>>> matter if only <N nodes are available…if even just a single available
node
>>>>> has some version of the requested document you will get a successful
>>>>> response (!).
>>>>>>>>> 
>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>> features to dynamically _choose_ between consistency or availability
— when
>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>> always gives you availability* regardless of what a given request actually
>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
than
>>>>> proceeding with no way of ever knowing whether a write did NOT ACTUALLY
>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were still
>>>>> down…)
>>>>>>>>> 
>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious
bug by a
>>>>> Cloudant engineer (or support personnel at least) but could not be quickly
>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>> 
>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility
with
>>>>> BigCouch? If true, could this read durability issue now be fixed during
the
>>>>> merge?
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> -natevw
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by
actual uptime
>>>>> of *any* Couch fork…
>>> 


Mime
View raw message