incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Goodall <matt.good...@gmail.com>
Subject Re: Incremental replication over unreliable link -- how granular is replication restart
Date Thu, 14 May 2009 22:26:18 GMT
2009/5/14 Adam Kocoloski <adam.kocoloski@gmail.com>:
> Hi Matt, going to snip a bit here to keep the discussion manageable ...
>
> On May 14, 2009, at 12:00 PM, Matt Goodall wrote:
>
>> When I tried things before writing my mail I was using two couchdb
>> servers running from relatively recent versions of trunk. So 0.9 and a
>> bit ;-).
>>
>> I didn't know about the ~10MB. I don't know if I reached that
>> threshold which may be why it seemed to be started over each time.
>> I'll try to retest with a lower threshold and more debugging to see
>> what's really happening. Any help on where that hard-coded 10MB value
>> is would be very helpful!
>
> In line 205 of couch_rep.erl you should see

:) Thanks, saved me a bit of time hunting for it.

>
>>    {NewBuffer, NewContext} = case couch_util:should_flush() of
>
> should_flush() takes an argument which is a number of bytes.  So changing
> that to
>
>>    {NewBuffer, NewContext} = case couch_util:should_flush(1000) of
>
> would cause the replicator to checkpoint after each kilobyte (on document
> boundaries, of course).  You should see a line in the logfile on the machine
> initiating the replication like

Yep, reducing the value allowed the checkpoints to happen frequently
enough to prove that you are absolutely correct - replication does
happen in batches and resumes from the last checkpointed batch.
Hurray!

>
> "recording a checkpoint at source update_seq N"
>
>>> Others have commented that the 10MB threshold really needs to be
>>> configurable.  E.g., set it to zero and you get per-document checkpoints,
>>> but your throughput will drop and the final DB size on the target will
>>> grow.
>>>  Super easy to do, but no one's gotten around to it.
>>
>> Presumably the threshold all depends on the quality of the network
>> connection between the two endpoints, although having the default
>> configurable is probably a good thing anyway.
>
> I think a configurable default is an OK option, but what I'd really like to
> see is the checkpoint threshold added as an optional field to the JSON body
> sent in an individual POST to _replicate.

Yep, exactly. I seem to have deleted the bit I had about a
per-replication option although the implication was still there.

Perhaps, the default should be configurable and be based on data size
*and* time, whichever is reached first? That way fast connections will
checkpoint fewer times with reduced DB size. Slow connections will
checkpoint more times but be more likely to be resumable.


>
>>>> Secondly, if the network connection fails in the middle of replication
>>>> (closing an ssh tunnel is a good way to test this ;-)) then it seems
>>>> to retry a few (10) times before the replicator process terminates. If
>>>> the network connection becomes available again (restart the ssh
>>>> tunnel) the replicator doesn't seem to notice. Also, I just noticed
>>>> that Futon still lists the replication on its status page.
>>>
>>> That's correct, the replicator does try to ignore transient failures.
>>
>> Hmm, it seemed to fail on transient failures here. After killing and
>> restarting my ssh tunnel I left the replication a while and it never
>> seemed to continue, and the only way to clear it from the status list
>> was to restart the couchdb server. I'll check again though.
>
> Ok, I misread you earlier.  It's possible that CouchDB or ibrowse is trying
> to reuse a socket when it really should be opening a new one.  That would be
> a bug.

This one definitely seems like a bug. Killing and restarting my SSH
tunnel basically kills the replication, I can see no sign of it
resuming.

You get this in the log ...

[error] [<0.62.0>] replicator terminating with reason {http_request_failed,
                                       [104,116,116,112,58,47,47,108,111,99,
                                        97,108,104,111,115,116,58,54,57,56,52,
                                        47,99,117,114,114,101,110,116,97,103,
                                        114,101,101,109,101,110,116,115,47,54,
                                        102,56,49,54,49,48,49,100,49,52,97,52,
                                        49,50,101,98,101,52,48,56,57,52,100,
                                        50,101,100,51,50,50,101,100,63,114,
                                        101,118,115,61,116,114,117,101,38,108,
                                        97,116,101,115,116,61,116,114,117,101,
                                        38,111,112,101,110,95,114,101,118,115,
                                        61,91,34,<<"1-1039057390">>,34,93]}
[info] [<0.62.0>] recording a checkpoint at source update_seq 213
[error] [<0.53.0>] Uncaught error in HTTP request: {exit,normal}
[info] [<0.61.0>] 127.0.0.1 - - 'GET' /_active_tasks 200
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[info] [<0.362.0>] retrying couch_rep HTTP post request due to {error,
conn_failed}: http://localhost:6984/currentagreements/_ensure_full_commit
[error] [<0.362.0>] couch_rep HTTP post request failed after 10
retries: http://localhost:6984/currentagreements/_ensure_full_commit

and then nothing.

Worst of all is that couch still thinks the replication is running and
refuses to start another one. Currently, the only solution is to
restart the couch server :-/.

>
>>>> If I'm correct, and I really hope I'm missing something, then
>>>> couchdb's replication is probably not currently suitable for
>>>> replicating anything but very small database differences over an
>>>> unstable connection. Does anyone have any real experience in this sort
>>>> of scenario?
>>>
>>> Personally, I do not.  I think the conclusion is a bit pessimistic,
>>> though.
>>
>> Sorry, wasn't meaning to be pessimistic. Just trying to report
>> honestly what I was seeing so it could be improved where possible.
>
> Absolutely, that statement probably came off too confrontational.  The more
> high-quality feedback like this we get the better off we'll be!  Cheers,
>
> Adam
>

Mime
View raw message