Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 5621 invoked from network); 3 Mar 2011 03:08:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Mar 2011 03:08:14 -0000 Received: (qmail 63192 invoked by uid 500); 3 Mar 2011 03:08:11 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 62399 invoked by uid 500); 3 Mar 2011 03:08:01 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 62247 invoked by uid 99); 3 Mar 2011 03:07:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Mar 2011 03:07:56 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Mar 2011 03:07:57 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4A99D4DEBC for ; Thu, 3 Mar 2011 03:07:37 +0000 (UTC) Date: Thu, 3 Mar 2011 03:07:37 +0000 (UTC) From: "Randall Leeds (JIRA)" To: dev@couchdb.apache.org Message-ID: <1840064683.10030.1299121657302.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <434228970.7294.1299047016888.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Updated: (COUCHDB-1080) fail fast with checkpoint conflicts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/COUCHDB-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randall Leeds updated COUCHDB-1080: ----------------------------------- Attachment: paranoid_checkpoint_failure_v2.patch Thanks for the feedback, Filipe. This version I clarified the new log message to offer a suggestion about why it occurred as well. The failure reason, along with the replication info, now get logged in one place in terminate/2. It should be easy for users to match up the first error (with suggestion about what to fix) with the replication that failed for the same reason (logged in terminate). No suggestion is made at error time that the replication might restart. When the supervisor restarts the dead replication it will log that at INFO (couch_replicator.erl#L127). How's this look to you? > fail fast with checkpoint conflicts > ----------------------------------- > > Key: COUCHDB-1080 > URL: https://issues.apache.org/jira/browse/COUCHDB-1080 > Project: CouchDB > Issue Type: Improvement > Components: Replication > Affects Versions: 1.0.2 > Reporter: Randall Leeds > Fix For: 1.1, 1.2 > > Attachments: paranoid_checkpoint_failure.patch, paranoid_checkpoint_failure_v2.patch > > > I've thought about this long and hard and probably should have submitted the bug a long time ago. I've also run this in production for months. > When a checkpoint conflict occurs it is almost always the right thing to do to abort. > If there is a rev mismatch it could mean there's are two conflicting (continuous and one-shot) replications between the same hosts running. Without reloading the history documents checkpoints will continue to fail forever. This could leave us in a state with many replicated changes but no checkpoints. > Similarly, a successful checkpoint but a lost/timed-out response could cause this situation. > Since the supervisor will restart the replication anyway, I think it's safer to abort and retry. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira