incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Incremental replication over unreliable link -- how granular is replication restart
Date Sun, 17 May 2009 17:08:39 GMT
On May 14, 2009, at 6:26 PM, Matt Goodall wrote:

>>>>> Secondly, if the network connection fails in the middle of  
>>>>> replication
>>>>> (closing an ssh tunnel is a good way to test this ;-)) then it  
>>>>> seems
>>>>> to retry a few (10) times before the replicator process  
>>>>> terminates. If
>>>>> the network connection becomes available again (restart the ssh
>>>>> tunnel) the replicator doesn't seem to notice. Also, I just  
>>>>> noticed
>>>>> that Futon still lists the replication on its status page.
>>>>
>>>> That's correct, the replicator does try to ignore transient  
>>>> failures.
>>>
>>> Hmm, it seemed to fail on transient failures here. After killing and
>>> restarting my ssh tunnel I left the replication a while and it never
>>> seemed to continue, and the only way to clear it from the status  
>>> list
>>> was to restart the couchdb server. I'll check again though.
>>
>> Ok, I misread you earlier.  It's possible that CouchDB or ibrowse  
>> is trying
>> to reuse a socket when it really should be opening a new one.  That  
>> would be
>> a bug.
>
> This one definitely seems like a bug. Killing and restarting my SSH
> tunnel basically kills the replication, I can see no sign of it
> resuming.
>
> You get this in the log ...

<snip>

> and then nothing.
>
> Worst of all is that couch still thinks the replication is running and
> refuses to start another one. Currently, the only solution is to
> restart the couch server :-/.

Thanks again for catching this bug, Matt.  The example you showed  
occurs when we record a checkpoint record, but there was also a  
similar problem with writing attachments to disk.  I've committed a  
very simplistic fix for the problem; the replicator should now realize  
that these requests are never going to complete and commit seppuku.   
Not the most elegant solution, perhaps, but it's certainly better than  
restarting the server.  The error message should take one of the  
following forms (still working on standardizing these error messages,  
of course):

{"error":"replication_link_failure", "reason":"{gen_server, call ...}"}
{"error":"internal_server_error", "reason":"replication_link_failure"}
{"error":"attachment_request_failed", "reason":"failed to replicate  
http://..."}
{"error":"attachment_request_failed", "reason":"ibrowse error on  
http://... : Reason"}

We'll work on a more fine-grained failure mode in the future.  Best,  
Adam

Mime
View raw message