Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 75826 invoked from network); 19 Oct 2009 14:37:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Oct 2009 14:37:49 -0000 Received: (qmail 50773 invoked by uid 500); 19 Oct 2009 14:37:47 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 50721 invoked by uid 500); 19 Oct 2009 14:37:47 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 50711 invoked by uid 99); 19 Oct 2009 14:37:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Oct 2009 14:37:47 +0000 X-ASF-Spam-Status: No, hits=-2.1 required=5.0 tests=AWL,BAYES_00,FS_REPLICA X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of adam.kocoloski@gmail.com designates 72.14.220.158 as permitted sender) Received: from [72.14.220.158] (HELO fg-out-1718.google.com) (72.14.220.158) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Oct 2009 14:37:45 +0000 Received: by fg-out-1718.google.com with SMTP id e12so2097171fga.5 for ; Mon, 19 Oct 2009 07:37:24 -0700 (PDT) Received: by 10.86.226.5 with SMTP id y5mr3128864fgg.36.1255963044004; Mon, 19 Oct 2009 07:37:24 -0700 (PDT) Received: from ?10.0.1.9? (c-66-31-20-188.hsd1.ma.comcast.net [66.31.20.188]) by mx.google.com with ESMTPS id e3sm99585fga.19.2009.10.19.07.37.18 (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 19 Oct 2009 07:37:23 -0700 (PDT) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1076) Subject: Re: Replication hangs From: Adam Kocoloski In-Reply-To: <1255961696.22528.57.camel@redemption> Date: Mon, 19 Oct 2009 10:37:19 -0400 Content-Transfer-Encoding: 7bit Message-Id: References: <1255958939.22528.37.camel@redemption> <1255960109.22528.45.camel@redemption> <1255960828.22528.51.camel@redemption> <12BF7A5C-9520-4C10-B0FA-17B2FC7D6B17@apache.org> <1255961696.22528.57.camel@redemption> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1076) On Oct 19, 2009, at 10:14 AM, Simon Eisenmann wrote: > Am Montag, den 19.10.2009, 10:04 -0400 schrieb Adam Kocoloski: >> On Oct 19, 2009, at 10:00 AM, Simon Eisenmann wrote: >> >>> Paul, >>> >>> Am Montag, den 19.10.2009, 09:53 -0400 schrieb Paul Davis: >>>> Hmmm, that sounds most odd. Are there any consistencies on when it >>>> hangs? Specifically, does it look like its a poison doc that causes >>>> things to go wonky or some such? Do nodes fail in a specific order? >>> >>> The only specificness i see is that somehow the slowest node never >>> seems >>> to fail. The other two nodes have roughly the same performance. >>> >>>> Also, you might try setting up the continuous replication instead >>>> of >>>> the update notifications as that might be a bit more ironed out. >>> >>> I already have considered that, though as long there is no way to >>> figure >>> out if a continous replication is still up and running i cannot use >>> it, >>> cause i have to restart it when a node fails and comes up again >>> later. >>> >>>> Another thing to check is if its just the task status that's >>>> wonky vs >>>> actual replication. You can check the _local doc that's created by >>>> replication to see if its update seq is changing while task >>>> statuses >>>> aren't. >>> >>> If only the status would hang, i should be able to start up the >>> replication again correct? Though this hangs as well. >> >> Hi Simon, is this hang related to the accept_failed bug report you >> just filed[1], or is it separate? Best, >> >> Adam >> >> [1]: https://issues.apache.org/jira/browse/COUCHDB-536 > > Hi Adam, > > i would consider it separate. The accept_failed issue happens only > when > having lots and lots of changes > > (essentially while True { put couple of docs, query views, delete > docs}) > > Simon So, until JIRA comes back online I'll follow up with that here. I think I could see how repeated pull replications in rapid succession could end up blowing through sockets. Each pull replication sets up one new connection for the _changes feed, and tears it down at the end (everything else replication-related goes through a connection pool). Do enough of those very short requests and you could end up with lots of connections in TIME_WAIT and eventually run out of sockets. FWIW, the default Erlang limit is slightly less than 1024. If your update_notification process uses a new connection for every POST to _replicate you'll hit the system limit (also 1024 in Ubuntu IIRC) twice as fast. Continuous replication is really our preferred solution for your scenario. If you can live with interpreting the records in the _local document to verify that it's still running you'll end up with a more efficient replication system all around. Regarding the hangs, if you do write a test script I'll be more than happy to try it and figure out what's going wrong. Best, Adam