couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Filipe Manana (JIRA)" <>
Subject [jira] Commented: (COUCHDB-1080) fail fast with checkpoint conflicts
Date Wed, 02 Mar 2011 11:21:36 GMT


Filipe Manana commented on COUCHDB-1080:

Hi Randall, thanks for the patch. My comments:

1) I would remove the log message 'rebooting ...' - we don't really know if it's really going
to be rebooted, it might be the last attempt made by the supervisor;

2) The log message this patch adds ""checkpoint failure: a database was restarted." is confusing
in my opinion. What does "database was restarted" means? I would change the message to something
like "checkpoint commit failure", because what realy happened in this case was that _ensure_full_commit
failed for the source or the target. I would also not remove the part "`~s` (`~s` -> `~s`)"
(replication id plus source and target information) from the log message, since it gives the
user information about which exact replication is in trouble.

The rest seems fine to me.

> fail fast with checkpoint conflicts
> -----------------------------------
>                 Key: COUCHDB-1080
>                 URL:
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Replication
>    Affects Versions: 1.0.2
>            Reporter: Randall Leeds
>             Fix For: 1.1, 1.2
>         Attachments: paranoid_checkpoint_failure.patch
> I've thought about this long and hard and probably should have submitted the bug a long
time ago. I've also run this in production for months.
> When a checkpoint conflict occurs it is almost always the right thing to do to abort.
> If there is a rev mismatch it could mean there's are two conflicting (continuous and
one-shot) replications between the same hosts running. Without reloading the history documents
checkpoints will continue to fail forever. This could leave us in a state with many replicated
changes but no checkpoints.
> Similarly, a successful checkpoint but a lost/timed-out response could cause this situation.
> Since the supervisor will restart the replication anyway, I think it's safer to abort
and retry.

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message