couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject Re: Incremental replication over unreliable link -- how granular is replication restart
Date Thu, 14 May 2009 16:17:47 GMT
Hi Matt, going to snip a bit here to keep the discussion manageable ...

On May 14, 2009, at 12:00 PM, Matt Goodall wrote:

> When I tried things before writing my mail I was using two couchdb
> servers running from relatively recent versions of trunk. So 0.9 and a
> bit ;-).
> I didn't know about the ~10MB. I don't know if I reached that
> threshold which may be why it seemed to be started over each time.
> I'll try to retest with a lower threshold and more debugging to see
> what's really happening. Any help on where that hard-coded 10MB value
> is would be very helpful!

In line 205 of couch_rep.erl you should see

>     {NewBuffer, NewContext} = case couch_util:should_flush() of

should_flush() takes an argument which is a number of bytes.  So  
changing that to

>     {NewBuffer, NewContext} = case couch_util:should_flush(1000) of

would cause the replicator to checkpoint after each kilobyte (on  
document boundaries, of course).  You should see a line in the logfile  
on the machine initiating the replication like

"recording a checkpoint at source update_seq N"

>> Others have commented that the 10MB threshold really needs to be
>> configurable.  E.g., set it to zero and you get per-document  
>> checkpoints,
>> but your throughput will drop and the final DB size on the target  
>> will grow.
>>  Super easy to do, but no one's gotten around to it.
> Presumably the threshold all depends on the quality of the network
> connection between the two endpoints, although having the default
> configurable is probably a good thing anyway.

I think a configurable default is an OK option, but what I'd really  
like to see is the checkpoint threshold added as an optional field to  
the JSON body sent in an individual POST to _replicate.

>>> Secondly, if the network connection fails in the middle of  
>>> replication
>>> (closing an ssh tunnel is a good way to test this ;-)) then it seems
>>> to retry a few (10) times before the replicator process  
>>> terminates. If
>>> the network connection becomes available again (restart the ssh
>>> tunnel) the replicator doesn't seem to notice. Also, I just noticed
>>> that Futon still lists the replication on its status page.
>> That's correct, the replicator does try to ignore transient failures.
> Hmm, it seemed to fail on transient failures here. After killing and
> restarting my ssh tunnel I left the replication a while and it never
> seemed to continue, and the only way to clear it from the status list
> was to restart the couchdb server. I'll check again though.

Ok, I misread you earlier.  It's possible that CouchDB or ibrowse is  
trying to reuse a socket when it really should be opening a new one.   
That would be a bug.

>>> If I'm correct, and I really hope I'm missing something, then
>>> couchdb's replication is probably not currently suitable for
>>> replicating anything but very small database differences over an
>>> unstable connection. Does anyone have any real experience in this  
>>> sort
>>> of scenario?
>> Personally, I do not.  I think the conclusion is a bit pessimistic,  
>> though.
> Sorry, wasn't meaning to be pessimistic. Just trying to report
> honestly what I was seeing so it could be improved where possible.

Absolutely, that statement probably came off too confrontational.  The  
more high-quality feedback like this we get the better off we'll be!   


View raw message