incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Replication hangs
Date Wed, 21 Oct 2009 13:08:56 GMT
On Oct 21, 2009, at 4:23 AM, Simon Eisenmann wrote:

> Hi,
>
> Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
>>>> Also, you might try setting up the continuous replication instead
>> of
>>>> the update notifications as that might be a bit more ironed out.
>>>
>>> I already have considered that, though as long there is no way to
>> figure
>>> out if a continous replication is still up and running i cannot use
>> it,
>>> cause i have to restart it when a node fails and comes up again
>> later.
>>
>> Hmm. Doesn't the _local doc for the continuous replication show if  
>> its
>> still in progress? Oh, though it might not have a specific flag
>> indicating as such.
>
> I changed the system to use continuous replication and checkin the
> _local doc to make sure it's still running. That way everything works
> fine and i cannot reproduce any hangs.
>
> Though in the logs i now see lots of
>
> [info] [<0.164.0>] A server has restarted sinced replication start.  
> Not
> recording the new sequence number to ensure the replication is redone
> and documents reexamined.
>
> Messages. I posted this in IRC yesterday and was told that this is
> nothing to worry about. So what exactly does it mean and why it is
> logged with info level when it can be ignored?
>
> If that message is nothing critical i would suggest to log it with  
> debug
> level, as it is shown at any replication checkpoint on any node as  
> soon
> as one of the other nodes was offline.

So, what we're trying to do here is avoid skipping updates from the  
source server.  Consider the following sequence of events:

1) Save some docs to the source with delayed_commits=true
2) Replicating source -> target
3) Restart source before full commit, losing the updates that have  
replicated
4) Save more docs to source, overwriting previously used sequence  
numbers

If that happens, we don't want the replicator to skip the new docs  
that have been saved in step 4.  So if we detect that a server  
restarted, we play it safe and don't checkpoint, so that the next  
replication will re-examine the sequence.  An analogous situation  
could happen with the target losing updates that the replicator had  
written (but not fully committed).

Skipping checkpointing altogether for the remainder of the replication  
is an overly conservative position.  In my opinion what we should do  
when we detect this condition is restart the replication immediately  
from the last known checkpoint.  Then you'd see one of these [info]  
level messages telling you that the replicator is going to restart to  
double-check some sequence numbers, and that's it.

Best, Adam

Mime
View raw message