couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randall Leeds (JIRA)" <j...@apache.org>
Subject [jira] Commented: (COUCHDB-704) Replication can lose checkpoints
Date Wed, 13 Oct 2010 08:04:32 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920469#action_12920469
] 

Randall Leeds commented on COUCHDB-704:
---------------------------------------

Can I get someone to take a quick look at this, please?
I don't see any reason not to commit this to 1.0.x and trunk and close it, and very good reason
to do so.

3 insertions and 1 deletion. Easy review. Get it while it's hot (well, it's already a month
and a half cold)!

Summary:
The bug - the replication log is updated by changing the last entry in place with the contents
of each checkpoint. This is fine except when nasty network errors cause the log to be written
on only one of the two dbs involved. If this occurs then the last history entry will not have
a matching session_id in the other log. Imagine 3 months of replication checkpoints lost because
a switch flapped. Ouch.

The change - replication keeps the same session_id across checkpoints. Even if only one log
is written, the last entries will still have a matching session_id and we can be sure that
the recorded_seq is committed to both. At most one checkpoint is lost.

> Replication can lose checkpoints
> --------------------------------
>
>                 Key: COUCHDB-704
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-704
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.11.2, 1.0.1
>            Reporter: Randall Leeds
>            Priority: Minor
>         Attachments: keep_session_id.patch, save-all-rep-checkpoints.patch, whitespace.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When saving replication checkpoints in the _local/<repid> document the new entry
is always pushed onto the _original_ "history" list property that existed at the start of
the replication. When any number of things causes the checkpoint to be written to only one
of the databases the head of the history list gets out of sync. Subsequent attempts to start
this replication must start from the latest common replication log entry in the _original_
history, as though this replication never occurred.
> A better idea is to push every checkpoint onto the history instead of replacing the head
on each save.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message