Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8064D9EB6 for ; Fri, 27 Apr 2012 10:04:27 +0000 (UTC) Received: (qmail 28467 invoked by uid 500); 27 Apr 2012 10:04:25 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 28428 invoked by uid 500); 27 Apr 2012 10:04:25 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 28400 invoked by uid 99); 27 Apr 2012 10:04:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Apr 2012 10:04:24 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [81.169.146.160] (HELO mo-p00-ob.rzone.de) (81.169.146.160) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Apr 2012 10:04:15 +0000 X-RZG-AUTH: :K2MKY0GkfvuAYI9OvLYEA55J0qvTZZULi9CTHjqnn8/d41Z9VA5z1TAdjxyBSJxL X-RZG-CLASS-ID: mo00 Received: from mail-yx0-f180.google.com ([209.85.213.180]) by smtp.strato.de (joses mo50) (RZmta 28.13 AUTH) with ESMTPA id z07287o3R7Ft1w for ; Fri, 27 Apr 2012 12:03:52 +0200 (CEST) Received: by yenl4 with SMTP id l4so280153yen.11 for ; Fri, 27 Apr 2012 03:03:51 -0700 (PDT) Received: by 10.236.92.178 with SMTP id j38mr5585477yhf.58.1335521031430; Fri, 27 Apr 2012 03:03:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.147.9.11 with HTTP; Fri, 27 Apr 2012 03:03:28 -0700 (PDT) From: Daniel Gonzalez Date: Fri, 27 Apr 2012 12:03:28 +0200 Message-ID: Subject: Replications stopping unexpectedly To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=20cf3011dde7ab205804bea638b4 X-Virus-Checked: Checked by ClamAV on apache.org --20cf3011dde7ab205804bea638b4 Content-Type: text/plain; charset=ISO-8859-1 Hello, I will describe my problem in a general way. If more details are needed, I will try to gather them from my production environments. We have several couchdb instances, with a bunch of databases. Some of these databases are connected via replication. Some of the replications are working via an ssh-tunnel, others by direct internet connection. The latency between couchdb instances ranges between few milliseconds to up de several hundreds of milliseconds. My problem is that it is very common for the replications to stop. It could due to connectivity being lost (sometimes the ssh tunnels fail and must be recreated), but this is not the only reason. And worse: the replications are not restarted automatically. They stay in error. The problem is so frequent that I have a replication monitor process looking for erroneous replications, and deleting and recreating the replication documents of those replications in error, every 5 minutes. This is the only method I have found to reliably restart the replications. Is somebody else experiencing similar problems? Do you have any suggestion on how to make replications more robust in front of connectivity issues? Are there other methods to restart erroneous replications, apart from redefining them? Thanks, Daniel Gonzalez --20cf3011dde7ab205804bea638b4--