couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: fault tolerance in the new replicator
Date Sun, 08 Mar 2009 06:30:18 GMT
I reconfigured things so that checkpoints are also saved every time  
the replicator flushes its internal document buffer.  Looks good so  
far.  Will try to get it committed to SVN tomorrow.  Best,

Adam

On Mar 7, 2009, at 10:28 PM, Adam Kocoloski wrote:

> And here I thought I was done with replication work for awhile ...
>
> When the new replicator streams an attachment, it uses ibrowse  
> without trying to do any error handling.  If the request fails, it  
> kills the whole replication without giving us a chance to checkpoint  
> anything.  That may not be such a great idea; transient network  
> failures are a fact of life.  A possible solution is to have the  
> replicator trap exits.  When an attachment request fails, the  
> replicator can catch the exit, roll back and retry.
>
> I committed some updates to my github branch.  I haven't had a  
> chance to do extensive testing yet, but the main ideas are
>
> * replicator traps exits.  Any linked process that dies (usually a  
> streaming attachment loop) causes the replicator to respawn the  
> document enumerator, which has the effect of redoing the replication  
> for the last < 100 updates.  There's no limit to the number of times  
> this loop can occur, but I think that's OK because ...
>
> * document requests are still made by the gen_server process  
> itself.  We had our own manual retry framework for these; that  
> framework is still in place.  After 10 failed attempts for a  
> particular document, the replicator will terminate with  
> http_request_failed.
>
> * If an abnormal termination occurs, the replicator will try to save  
> the current status in the _local docs on source and target.  If it's  
> successful, the next replicator can pick up where this one left off.
>
> * I also tried to clean up the error messages a bit, returning  
> {"error":"http_request_failed", "reason":Url} instead of dumping the  
> first line of the Erlang traceback on the client.
>
> Hopefully I can commit this tomorrow after some further testing.   
> Cheers, Adam
>
> http://github.com/kocolosk/couchdb/tree/otpify-replication
>


Mime
View raw message