Return-Path: X-Original-To: apmail-couchdb-dev-archive@www.apache.org Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 48B1B9C72 for ; Fri, 16 Dec 2011 09:39:01 +0000 (UTC) Received: (qmail 12248 invoked by uid 500); 16 Dec 2011 09:39:00 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 12197 invoked by uid 500); 16 Dec 2011 09:39:00 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 12118 invoked by uid 99); 16 Dec 2011 09:39:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Dec 2011 09:39:00 +0000 X-ASF-Spam-Status: No, hits=-2001.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Dec 2011 09:38:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B3FDD116212 for ; Fri, 16 Dec 2011 09:38:30 +0000 (UTC) Date: Fri, 16 Dec 2011 09:38:30 +0000 (UTC) From: "Alex Markham (Updated) (JIRA)" To: dev@couchdb.apache.org Message-ID: <384985034.18920.1324028310738.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1187459894.15059.1323946470603.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (COUCHDB-1364) Replication hanging/failing on docs with lots of revisions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Markham updated COUCHDB-1364: ---------------------------------- Attachment: couchlog target host32.txt checkpoint hang seq changes.txt couchlog source host17.log Ok, perhaps that stack trace was a separate issue. - with it patched, push replication just times out with the same error as the do_checkpoint error push.txt I have done some more analysis - and found 2 seq ids where replication hangs from host 17 -> 32. After this point there are no more checkpoints. I managed to get continuous replication working again past it by manually updating the replication _local document on both sources to 1 source_last_seq id after this point. I attach 3 logs: 1) checkpoint hang seq changes - this is the changes feed from the source at the 2 points in replication which hang - both are docs with very high number of open revisions. 2) couchlog source host 17.log - this is the log of the couch where you can see the GET requests for those documents being made, and then replication starting again. 3) couchlog target host32.txt - this is the log of the couch doing the pull replication - the target, there were no errors here. I doubt the network is the issue here, all other databases replicate fine, and this replication stream works once manually tweaked to go past those choke points. Is it possible that the database is trying to commit all these hundreds of documents at once and it takes a crazy long time? I have set up a parallel replication to the same host using a different hostname (so that the replication id is different) without the "cancel:true" that I send every minute and it has hung on that checkpoint (2745054) for the last few hours > Replication hanging/failing on docs with lots of revisions > ---------------------------------------------------------- > > Key: COUCHDB-1364 > URL: https://issues.apache.org/jira/browse/COUCHDB-1364 > Project: CouchDB > Issue Type: Bug > Components: Replication > Affects Versions: 1.0.3, 1.1.1 > Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched for COUCHDB-1340 and COUCHDB-1333 > Reporter: Alex Markham > Labels: open_revs, replication > Attachments: COUCHDB-1364-11x.patch, checkpoint hang seq changes.txt, couchlog source host17.log, couchlog target host32.txt, do_checkpoint error push.txt, replication error changes_loop died redacted.txt > > > We have a setup where replication from a 1.1.1 couch is hanging - this is WAN replication which previously worked 1.0.3 <-> 1.0.3. > Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the 1.0.3 couch to our 1.1.1 build which had this patched. > However - the replication between the 2 1.1.1 couches is hanging at a certain point when doing continuous pull replication - it doesn't checkpoint, just stays on "starting" however, when cancelled and restarted it gets the latest documents (so doc counts are equal). The last calls I see to the source db when it hangs are multiple long GETs for a document with 2051 open revisions on the source and 498 on the target. > When doing a push replication the _replicate call just gives a 500 error (at about the same seq id as the pull replication hangs at) saying: > [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with reason {noproc, > {gen_server,call, > [<0.6382.115>, > {pread_iolist, > 79043596434}, > infinity]}} > when the last call in the target of the push replication is: > [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' /master_db/_missing_revs 200 > with no stack trace. > Comparing the open_revs=all count on the documents with many open revs shows differing numbers on each side of the replication WAN and between different couches in the same datacentre. Some of these documents have not been updated for months. Is it possible that 1.0.3 just skipped over this issue and carried on replicating, but 1.1.1 does not? > I know I can hack the replication to work by updating the checkpoint seq past this point in the _local document, but I think there is a real bug here somewhere. > If wireshark/debug data is required, please say -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira