couchdb-replication mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: _bulk_get protocol extension
Date Fri, 24 Jan 2014 18:15:25 GMT
The replicator knows how to pipeline, yes, but the server considers each pipelined request
separately, whereas theoretically with a _bulk_get it could batch the btree operations as
well (and eliminate some redundant inner node lookups).  I'm also not sure how efficient the
ibrowse client's implementation of pipelining is in the end -- I've not observed the kinds
of speedups that I'd expect with it.

Jens, I believe Apache CouchDB already does handle nested multipart on the PUT side (multiple
revisions of a document with attachments), but I'll admit the code to do so is rather difficult
to grok and could really benefit from a refactor, which would certainly enable a nice implementation
of something like _bulk_get.

Adam

On Jan 24, 2014, at 12:35 PM, Yaron Goland <yarong@microsoft.com> wrote:

> In the HTTP WG more than a decade ago issues like this came up under the name 'boxcar'ing'.
But with the introduction of pipelining the performance benefits of boxcar'ing for idempotent
requests went away. 
> 
> In a replication the source should be able to fire off GET requests down the pipeline
non-stop and the remote server should be able to return them just as quickly. So have you
identified why you are seeing bad performance?
> 
> 	Thanks,
> 
> 			Yaron
> 
>> -----Original Message-----
>> From: Jens Alfke [mailto:jens@couchbase.com]
>> Sent: Friday, January 24, 2014 7:21 AM
>> To: replication@couchdb.apache.org
>> Subject: _bulk_get protocol extension
>> 
>> (I'm excited about this list! There have been some topics I've wanted to bring
>> up that are too implementation-oriented for the user@ list, but I haven't
>> been brave enough to dive into the dev@ list because I don't know Erlang or
>> the internals of CouchDB. I also really appreciate folks sharing the viewpoint
>> that CouchDB is an ecosystem and an open replication protocol, not just a
>> particular database implementation.)
>> 
>> Anyway. One topic I'd like to bring up is that, in my non-scientific
>> observations, the major performance bottleneck in pull replications is the
>> fact that revisions have to be transferred using individual GET requests. I've
>> seen very poor performance when pulling lots of small documents from a
>> distant server, like an order of magnitude below the throughput of sending a
>> single huge document.
>> 
>> (Yes, it's possible to get multiple revisions at once by POSTing to _all_docs.
>> Unfortunately this has limitations that make it unsuitable for replication; see
>> my explanation at the page linked below.)
>> 
>> A few months ago I experimentally implemented a new "_bulk_get" REST call
>> in Couchbase's replicators (Couchbase Lite and the Sync Gateway), which
>> significantly improves performance by allowing the puller to request any
>> number of revisions in a single HTTP request. Again, no scientific tests or hard
>> numbers, but it was enough to convince me it's worthwhile. I've
>> documented it here:
>> 	https://github.com/couchbase/sync_gateway/wiki/Bulk-GET
>> It's pretty straightforward and I've tried to make it consistent with the
>> standard API. The only unusual thing is that the response can contain nested
>> MIME multipart bodies: the response format is multipart, with every
>> requested revision in a part, but revisions containing attachments are
>> themselves sent as multipart. (This shouldn't be an issue for any decent
>> multipart parser, since nested multipart is pretty common in emails, but I
>> think it's the first time it's happened in the CouchDB API.)
>> 
>> I'd be happy if this were implemented in CouchDB and made an official part
>> of the API. Hopefully the spec I wrote is detailed enough to make that
>> straightforward. (I don't have the Erlang skills to do it myself, though.)
>> 
>> -Jens


Mime
View raw message