couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Eisenmann <si...@struktur.de>
Subject Re: Replication hangs
Date Wed, 21 Oct 2009 13:15:17 GMT
Am Mittwoch, den 21.10.2009, 09:08 -0400 schrieb Adam Kocoloski:
> > Though in the logs i now see lots of
> >
> > [info] [<0.164.0>] A server has restarted sinced replication
> start.  
> > Not
> > recording the new sequence number to ensure the replication is
> redone
> > and documents reexamined.
> >
> > Messages. I posted this in IRC yesterday and was told that this is
> > nothing to worry about. So what exactly does it mean and why it is
> > logged with info level when it can be ignored?
> >
> > If that message is nothing critical i would suggest to log it with  
> > debug
> > level, as it is shown at any replication checkpoint on any node as  
> > soon
> > as one of the other nodes was offline.
> 
> So, what we're trying to do here is avoid skipping updates from the  
> source server.  Consider the following sequence of events:
> 
> 1) Save some docs to the source with delayed_commits=true
> 2) Replicating source -> target
> 3) Restart source before full commit, losing the updates that have  
> replicated
> 4) Save more docs to source, overwriting previously used sequence  
> numbers
> 
> If that happens, we don't want the replicator to skip the new docs  
> that have been saved in step 4.  So if we detect that a server  
> restarted, we play it safe and don't checkpoint, so that the next  
> replication will re-examine the sequence.  An analogous situation  
> could happen with the target losing updates that the replicator had  
> written (but not fully committed).
> 
> Skipping checkpointing altogether for the remainder of the
> replication  
> is an overly conservative position.  In my opinion what we should do  
> when we detect this condition is restart the replication immediately  
> from the last known checkpoint.  Then you'd see one of these [info]  
> level messages telling you that the replicator is going to restart
> to  
> double-check some sequence numbers, and that's it.

Ok. Understood. Thanks for the explanation. If that behaviour would only
execute once i would be absolutely fine. But with the current
implementation this is done forever and replication never seems to
switch to normal mode again.

Best regards
Simon


-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstra├če 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Mime
View raw message