couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: Replication hangs
Date Wed, 21 Oct 2009 13:36:04 GMT

On 21 Oct 2009, at 15:08, Adam Kocoloski wrote:

> On Oct 21, 2009, at 4:23 AM, Simon Eisenmann wrote:
>
>> Hi,
>>
>> Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
>>>>> Also, you might try setting up the continuous replication instead
>>> of
>>>>> the update notifications as that might be a bit more ironed out.
>>>>
>>>> I already have considered that, though as long there is no way to
>>> figure
>>>> out if a continous replication is still up and running i cannot use
>>> it,
>>>> cause i have to restart it when a node fails and comes up again
>>> later.
>>>
>>> Hmm. Doesn't the _local doc for the continuous replication show if  
>>> its
>>> still in progress? Oh, though it might not have a specific flag
>>> indicating as such.
>>
>> I changed the system to use continuous replication and checkin the
>> _local doc to make sure it's still running. That way everything works
>> fine and i cannot reproduce any hangs.
>>
>> Though in the logs i now see lots of
>>
>> [info] [<0.164.0>] A server has restarted sinced replication start.  
>> Not
>> recording the new sequence number to ensure the replication is redone
>> and documents reexamined.
>>
>> Messages. I posted this in IRC yesterday and was told that this is
>> nothing to worry about. So what exactly does it mean and why it is
>> logged with info level when it can be ignored?
>>
>> If that message is nothing critical i would suggest to log it with  
>> debug
>> level, as it is shown at any replication checkpoint on any node as  
>> soon
>> as one of the other nodes was offline.
>
> So, what we're trying to do here is avoid skipping updates from the  
> source server.  Consider the following sequence of events:
>
> 1) Save some docs to the source with delayed_commits=true
> 2) Replicating source -> target
> 3) Restart source before full commit, losing the updates that have  
> replicated
> 4) Save more docs to source, overwriting previously used sequence  
> numbers
>
> If that happens, we don't want the replicator to skip the new docs  
> that have been saved in step 4.  So if we detect that a server  
> restarted, we play it safe and don't checkpoint, so that the next  
> replication will re-examine the sequence.  An analogous situation  
> could happen with the target losing updates that the replicator had  
> written (but not fully committed).
>
> Skipping checkpointing altogether for the remainder of the  
> replication is an overly conservative position.  In my opinion what  
> we should do when we detect this condition is restart the  
> replication immediately from the last known checkpoint.  Then you'd  
> see one of these [info] level messages telling you that the  
> replicator is going to restart to double-check some sequence  
> numbers, and that's it.
>
> Best, Adam

Adam, this mail is great Wiki material. Can you (or anyone) find a  
place for it on the wiki for future reference?

Cheers
Jan
--


Mime
View raw message