couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Shorin <kxe...@gmail.com>
Subject Re: [REVIEW] CouchDB Replication Protocol
Date Tue, 12 Nov 2013 02:35:47 GMT
Hi Jens!

Thanks for looking into! Reply inlined

On Tue, Nov 12, 2013 at 5:05 AM, Jens Alfke <jens@couchbase.com> wrote:
> Thanks, Alexander.
>
> How would you like us to edit the source? I’ve never worked with
> collaborative editing using gists. Does it work like source editing where I
> fork it and send you a pull request?

I was wrong thinking that working with gist is good way to go since
there is not support for PR and I'll have to merge all the forks
manually:
http://stackoverflow.com/questions/8758612/can-i-make-a-pull-request-on-a-gist-on-github

so I've created regular git repository, so feel free to send PRs!
https://github.com/kxepal/couchdb-replication-protocol

> In general it needs a lot of proofreading for English grammar.

That's the thing I'm bad in ): Would be glad for help with it. If you
in doubt what I have trying to say, I could try rephrase it, but
probably with no better grammar ):

> In general I have a feeling this whole process is a case of “the best being
> the enemy of the good”. Something short like my original document would have
> been fine, IMHO. It’s nice that you put so much effort into this, but it’s
> now a very long document that’s going to take a long time to review.

You're correct, but I found short version isn't much helpful for
creating own peer. That's how and why I had started this document (:
Yes, some main points are explained, but devil in details and I have
to switch to wireshark and read couchdb logs with couch_replicator
source code to understand why replication crushes and what peer should
return on CouchDB request. On some point, that was reimplementing a
part of CouchDB HTTP API, so I think it's possible to replace this
document examples and text stuff with references to API endpoints
definition. However, actually you don't have to implement whole API
for GET /db/docid resource to have replication works, just some small
part, so such references may only confuse more.

>  If you add section numbers, it would be easier to discuss the document
> since we could refer to a specific section unambiguously.

Good idea. Have update HTML version.

> Diagrams: Will these be redone as graphics? The ASCII art is kind of hard to
> read.

Actually, the result graphviz charts are more harder to read since
they are become more unstructured.

> Sample REST request/responses: These have some very long lines that aren’t
> wrapped in the HTML (at least not in my browser, Safari 7.) That makes them
> hard to read.

This is true for `Fetch Changed Documents` section examples. Fixed.

> In the “Verify Peers” section:
> You’re making the assumption that the local database is accessible via HTTP
> using the CouchDB-compatible REST API, which isn’t necessarily true. It’s
> only true of Couchbase Lite if optional components are installed. I don’t
> know if it’s true of PouchDB. In general, an embedded database will be using
> a more direct native API to provide access to local data.

I know, but I made assumption that working with local database using
internal API is an protocol optimization, not general behaviour. If
you set local database name in url form CouchDB replicator will work
with it via HTTP API. I think it's too oblivious to not make HTTP
requests when it's possible to and optimization tips are suggest that.

> In the “Get Peers Information” section:
> It’s not explained why the “instance_start_time” field is needed or what
> it’s used for. (I am still not sure why it’s needed; I just found by trial
> and error that CouchDB expected it.)

Same for me (: Especially having this field mandatory for
/db/_ensure_full_commit response confuses a bit.

> The description of what “update_seq” is for is vague; I couldn’t figure out
> what you meant. Also, its value will not necessarily be a number — in
> BigCouch or Cloudant it’s an opaque string, I think.

Good point, will fix this.

> In the “Retrieve Replication Logs” section:
> The format of these logs is implementation-specific, since the only thing
> that ever reads a log is the replicator that created it. You’ve described
> the format CouchDB uses, but for instance the data stored by Couchbase Lite
> is entirely different.

I was think the same. However, there are some fields in stored
document that replicator expected to see - they are marked as
Required. Others are oblivious optional. Do you think it's better to
remove them to not confuse with such details much?

> I think it’s still useful to show what CouchDB stores, but you don’t need to
> specify it in such detail, and it should be prefaced as being only one
> possible way to store the data.

Agreed.

> “Listen Changes Feed”:
> • The ‘feed’ parameter does not have to be ‘continuous’. This parameter has
> nothing to do with continuous replication; it’s just a similar name. For
> instance, Couchbase Lite only uses ‘longpoll’ since the continuous feed
> doesn’t work reliably with telco proxy servers.

I think I'm starting to understand one more important thing. This
document is about CouchDB replication protocol and how
couch_replicator uses HTTP API during replication process. However,
various replicator implementations may use it in different way as
original one, but this will force peers to implement CouchDB HTTP API
completely or continuously add missed bits to support replication with
Couchbase, PouchDB, Cloudant etc.
Not sure for now how to handle such situation correctly.

> • Does the CouchDB replicator really use a heartbeat of 10 seconds?! That
> seems really short, and will generate a lot of traffic for a server with a
> large number of replication clients. (10000 mobile clients connected = 1000
> heartbeats sent per second. Yikes!)

Good find! I miss this thing to correct. Actual heartbeat value is
connection_timeout divided by 3, but not less than 1000 (0.1 second)
https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L376
so it's configurable

> • The ‘since’ parameter shouldn’t be set to 0. If you don’t have a value,
> just omit it. Again, not all servers use numeric values.

Ok, will fix this with adoption for BigCouch / Cloudant.

> • "Reading whole feed with single shot may be not resource optimal solution
> and it is RECOMMENDED to process feed by chunks.” This is an implementation
> specific detail; it might be true of CouchDB but isn’t of other
> implementations. I don’t think it’s appropriate for this document. If you’re
> talking about chunks for the _revs_diff request that follows, that’s
> independent of the changes feed.

No, this is about changes feed. May be "chunks" word is wrong a bit. I
mean there, for instance, if changes feed contains 100500 "events",
it's recommended to process them by portions, say, of 1000 per
iteration. While it's too oblivious solution for such case, it also
clearly defines that there loop is going and you don't have to work
with all response data at once.

> …ok, that’s as much as I have time for today!

and again, thank you very much!(:

Mime
View raw message