Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 53803 invoked from network); 2 Mar 2011 11:22:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Mar 2011 11:22:02 -0000 Received: (qmail 17059 invoked by uid 500); 2 Mar 2011 11:22:01 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 16565 invoked by uid 500); 2 Mar 2011 11:21:59 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 16552 invoked by uid 99); 2 Mar 2011 11:21:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 11:21:57 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 11:21:57 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id E24AD4B8F7 for ; Wed, 2 Mar 2011 11:21:36 +0000 (UTC) Date: Wed, 2 Mar 2011 11:21:36 +0000 (UTC) From: "Filipe Manana (JIRA)" To: dev@couchdb.apache.org Message-ID: <788004842.7678.1299064896923.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <434228970.7294.1299047016888.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Commented: (COUCHDB-1080) fail fast with checkpoint conflicts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/COUCHDB-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001375#comment-13001375 ] Filipe Manana commented on COUCHDB-1080: ---------------------------------------- Hi Randall, thanks for the patch. My comments: 1) I would remove the log message 'rebooting ...' - we don't really know if it's really going to be rebooted, it might be the last attempt made by the supervisor; 2) The log message this patch adds ""checkpoint failure: a database was restarted." is confusing in my opinion. What does "database was restarted" means? I would change the message to something like "checkpoint commit failure", because what realy happened in this case was that _ensure_full_commit failed for the source or the target. I would also not remove the part "`~s` (`~s` -> `~s`)" (replication id plus source and target information) from the log message, since it gives the user information about which exact replication is in trouble. The rest seems fine to me. cheers > fail fast with checkpoint conflicts > ----------------------------------- > > Key: COUCHDB-1080 > URL: https://issues.apache.org/jira/browse/COUCHDB-1080 > Project: CouchDB > Issue Type: Improvement > Components: Replication > Affects Versions: 1.0.2 > Reporter: Randall Leeds > Fix For: 1.1, 1.2 > > Attachments: paranoid_checkpoint_failure.patch > > > I've thought about this long and hard and probably should have submitted the bug a long time ago. I've also run this in production for months. > When a checkpoint conflict occurs it is almost always the right thing to do to abort. > If there is a rev mismatch it could mean there's are two conflicting (continuous and one-shot) replications between the same hosts running. Without reloading the history documents checkpoints will continue to fail forever. This could leave us in a state with many replicated changes but no checkpoints. > Similarly, a successful checkpoint but a lost/timed-out response could cause this situation. > Since the supervisor will restart the replication anyway, I think it's safer to abort and retry. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira