couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Newson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-2310) Add a bulk API for revs & open_revs
Date Thu, 18 Dec 2014 23:01:13 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252483#comment-14252483
] 

Robert Newson commented on COUCHDB-2310:
----------------------------------------

A couple of comments (and this is getting off-topicy);

1) the proprietary extensions in couchdb-related projects can be an excellent guide for couchdb
itself but we're not beholden to them. In fact, I said we're obliged to consider the "for
the ages" aspect when considering incorporating them.
2) Agree that enhancing _all_docs is tricky given the poor request handling of the past (specifically,
not return a 400 Bad Request when given unexpected input).
3) Adding a new feature in 1.7 that we don't necessarily intend to keep in 2.0 is a terrible
idea, even if marked experimental.
4) _bulk_get remains a poor name. _bulk_revs is better, even if the code is the same as _bulk_get.
5) whether a 1.7 release happens is controversial. I think it should not happen, it's a significant
effort and slows 2.0 release even further.

In summary, I suggest we add _bulk_revs with the rcouch code assuming it passes muster (formatting,
tests, etc) on couchdb standards. Add it should be added to the couchdb-couch master and couchdb-chttpd
with a backport to a 1.x branch of top-level couchdb if (and only if) someone is prepared
to make 1.7 happen (my 5 implies that I will not exert personal effort to make that happen,
but I'm not going to stop others if they wish to spend their time, unless they would otherwise
had exerted effort to make the important 2.0 release happen sooner).

summary of summary: new feature work occurs on master, backported if appropriate.


> Add a bulk API for revs & open_revs
> -----------------------------------
>
>                 Key: COUCHDB-2310
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2310
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: HTTP Interface
>            Reporter: Nolan Lawson
>
> CouchDB replication is too slow.
> And what makes it so slow is that it's just so unnecessarily chatty. During replication,
you have to do a separate GET for each individual document, in order to get the full {{_revisions}}
object for that document (using the {{revs}} and {{open_revs}} parameters &ndash; refer
to [the TouchDB writeup|https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm]
or [Benoit's writeup|http://dataprotocols.org/couchdb-replication/] if you need a refresher).
> So for example, let's say you've got a database full of 10,000 documents, and you replicate
using a batch size of 500 (batch sizes are configurable in PouchDB). The conversation for
a single batch basically looks like this:
> {code}
> - REPLICATOR: gimme 500 changes since seq X (1 GET request)
>   - SOURCE: okay
> - REPLICATOR: gimme the _revs_diff for these 500 docs/_revs (1 POST request)
>   - SOURCE: okay
> - repeat 500 times:
>   - REPLICATOR: gimme the _revisions for doc n with _revs [...] (1 GET request)
>     - SOURCE: okay
> - REPLICATOR: here's a _bulk_docs with 500 documents (1 POST request)
>     - TARGET: okay
> {code}
> See the problem here? That 500-loop, where we have to do a GET for each one of 500 documents,
is a lot of unnecessary back-and-forth, considering that the replicator already knows what
it needs before the loop starts. You can parallelize, but if you assume a browser (e.g. for
PouchDB), most browsers only let you do ~8 simultaneous requests at once. Plus, there's latency
and HTTP headers to consider. So overall, it's not cool.
> So why do we even need to do the separate requests? Shouldn't {{_all_docs}} be good enough?
Turns out it's not, because we need this special {{_revisions}} object.
> For example, consider a document {{'foo'}} with 10 revisions. You may compact the database,
in which case revisions {{1-x}} through {{9-x}} are no longer retrievable. However, if you
query using {{revs}} and {{open_revs}}, those rev IDs are still available:
> {code}
> $ curl 'http://nolan.iriscouch.com/test/foo?revs=true&open_revs=all'
> {
>   "_id": "foo",
>   "_rev": "10-c78e199ad5e996b240c9d6482907088e",
>   "_revisions": {
>     "start": 10,
>     "ids": [
>       "c78e199ad5e996b240c9d6482907088e",
>       "f560283f1968a05046f0c38e468006bb",
>       "0091198554171c632c27c8342ddec5af",
>       "e0a023e2ea59db73f812ad773ea08b17",
>       "65d7f8b8206a244035edd9f252f206ad",
>       "069d1432a003c58bdd23f01ff80b718f",
>       "d21f26bb604b7fe9eba03ce4562cf37b",
>       "31d380f99a6e54875855e1c24469622d",
>       "3b4791360024426eadafe31542a2c34b",
>       "967a00dff5e02add41819138abb3284d"
>     ]
>   }
> }
> {code}
> And in the replication algorithm, _this full \_revisions object is required_ at the point
when you copy the document from one database to another, which is accomplished with a POST
to {{_bulk_docs}} using {{new_edits=false}}. If you don't have the full {{_revisions}} object,
CouchDB accepts the new revision, but considers it to be a conflict. (The exception is with
generation-1 documents, since they have no history, so as it says in the TouchDB writeup,
you can safely just use {{_all_docs}} as an optimization for such documents.)
> And unfortunately, this {{_revision}} object is only available from the {{GET /:dbid/:docid}}
endpoint. Trust me; I've tried the other APIs. You can't get it anywhere else.
> This is a huge problem, especially in PouchDB where we often have to deal with CORS,
meaning the number of HTTP requests is doubled. So for those 500 GETs, it's an extra 500 OPTIONs,
which is just unacceptable.
> Replication does not have to be slow. While we were experimenting with ways of fetching
documents in bulk, we tried a technique that just relied on using {{_changes}} with {{include_docs=true}}
([|\#2472|https://github.com/pouchdb/pouchdb/pull/2472]). This pushed conflicts into the target
database, but on the upside, you can sync ~95k documents from npm's skimdb repository to the
browser in less than 20 minutes! (See [npm-browser.com|http://npm-browser.com] for a demo.)
> What an amazing story we could tell about the beauty of CouchDB replication, if only
this trick actually worked!
> My proposal is a simple one: just add the {{revs}} and {{open_revs}} options to {{_all_docs}}.
Presumably this would be aligned with {{keys}}, so similar to how {{keys}} takes an array
of docIds, {{open_revs}} would take an array of array of revisions. {{revs}} would just be
a boolean.
> This only gets hairy in the case of deleted documents. In this example, {{bar}} is deleted
but {{foo}} is not:
> {code}
> curl -g 'http://nolan.iriscouch.com/test/_all_docs?keys=["bar","foo"]&include_docs=true'
> {"total_rows":1,"offset":0,"rows":[
> {"id":"bar","key":"bar","value":{"rev":"2-eec205a9d413992850a6e32678485900","deleted":true},"doc":null},
> {"id":"foo","key":"foo","value":{"rev":"10-c78e199ad5e996b240c9d6482907088e"},"doc":{"_id":"foo","_rev":"10-c78e199ad5e996b240c9d6482907088e"}}
> ]}
> {code}
> The cleanest would be to attach the {{_revisions}} object to the {{doc}}, but if you
use {{keys}}, then the deleted documents are returned with {{doc: null}}, even if you specify
{{include_docs=true}}. One workaround would be to simply add a {{revisions}} object to the
{{value}}.
> If all of this would be too difficult to implement under the hood in CouchDB, I'd also
be happy to get the {{_revisions}} back in {{_changes}}, {{_revs_diff}}, or even in a separate
endpoint. I don't care, as long as there is some bulk API where I can get multiple {{_revisions}}
for multiple documents at once.
> On the PouchDB end of things, we would really like to push forward on this. I'm happy
to implement a Node.js proxy to stand in front of CouchDB/Cloudant/CSG and add this new API,
plus adding it directly to PouchDB Server. I can invent whatever API I want, but the main
thing is that I would like this API to be something that all the major players can agree upon
(Apache, Cloudant, Couchbase) so that eventually the proxy is no longer necessary.
> Thanks for reading the WoT. Looking forward to a faster CouchDB replication protocol,
since it's the thing that ties us all together and makes this crazy experiment worthwhile.
> Background: [this|https://github.com/pouchdb/pouchdb/issues/2686] and [this|https://gist.github.com/nolanlawson/340cb898f8ed9f3db8a0].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message