Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 00067200C41 for ; Fri, 24 Mar 2017 17:03:53 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id F2C6B160B82; Fri, 24 Mar 2017 16:03:53 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1E970160B93 for ; Fri, 24 Mar 2017 17:03:52 +0100 (CET) Received: (qmail 8307 invoked by uid 500); 24 Mar 2017 16:03:51 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 8282 invoked by uid 99); 24 Mar 2017 16:03:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Mar 2017 16:03:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id ABE64C147E for ; Fri, 24 Mar 2017 16:03:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.754 X-Spam-Level: ** X-Spam-Status: No, score=2.754 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, KAM_BADIPHTTP=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=messagingengine.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id bPFzCrbNLhdw for ; Fri, 24 Mar 2017 16:03:48 +0000 (UTC) Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9768A5FB30 for ; Fri, 24 Mar 2017 16:03:48 +0000 (UTC) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id B0DD220A13; Fri, 24 Mar 2017 12:03:45 -0400 (EDT) Received: from frontend2 ([10.202.2.161]) by compute4.internal (MEProxy); Fri, 24 Mar 2017 12:03:45 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc:x-sasl-enc; s= fm1; bh=Xp48bCQYMkfllCZW4kOygwR7vD5KEM8HCePURrgB5l4=; b=SCs/LnUO wqf9jKjedCJHiVuDXBQBzPGKHsxjYVohofc/J9uR2Z30vW4Xz98RX320NUaXwEIk QtT+Ws4pgaA0Mm1zlUbbk+E+j4uNFloYcimTJgKNqNWm3Cj2m5ejLfavQ86BdCHT scQ5v5rcbKvqhddg0WDmA6pOrMxeRzKYp/8cMacdK/D6kkPdcIthlLAegUI2JmsJ xmqytJgQqEby9QPuIXYsZQbGEBpXdoO7FXhUstV54nXBPcHGqBUJkQTGgM6MTOvG gv8x4TNupbDsoYcM5Z0ltIK4BDqynnbQ8agSZKOXqG0ameIcrc9JfAZZ2HI5T9BE 0g5C6gLx0eWRlg== X-ME-Sender: X-Sasl-enc: 5cTI+UTaeSnlYtSnrck9Q9Dl37CPdIYZou3dOV8hEDDD 1490371425 Received: from [198.18.15.86] (unknown [217.146.29.69]) by mail.messagingengine.com (Postfix) with ESMTPA id 1EEE42400E; Fri, 24 Mar 2017 12:03:45 -0400 (EDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: incomplete replication under 2.0.0 From: Robert Samuel Newson In-Reply-To: <20170321122642.0b1b7d5c2943b36a2179f32f@groupring.net> Date: Fri, 24 Mar 2017 16:03:44 +0000 Cc: user@couchdb.apache.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170309170830.8f3f87dc5c238c39f85dcd21@groupring.net> <43A45E6D-4AEC-4D79-8C6C-22D307A15AA8@apache.org> <20170315144513.f3c6dea3c56bd99aecc15586@groupring.net> <20170315194327.cc61f8c01f6f6a4028704221@groupring.net> <20170321122642.0b1b7d5c2943b36a2179f32f@groupring.net> To: "Christopher D. Malon" X-Mailer: Apple Mail (2.3259) archived-at: Fri, 24 Mar 2017 16:03:54 -0000 sorry for late reply. That's very curious. Can you file a JIRA for this? If the replicator = says it replicated to the target, that should always be true. I can't = immediately think why emfile would wreck that (I'd expect the writes to = either fail or succeed and for the replicator to agree). B. > On 21 Mar 2017, at 16:26, Christopher D. Malon = wrote: >=20 > These problems appear to be due to the replicator crashing > with {error,{conn_failed,{error,emfile}}}, which apparently > means that I surpassed an open file limit. >=20 > The replications were successful if I executed >=20 > ulimit -Sn 4096 >=20 > prior to launching CouchDB, in the same shell. >=20 > I'm a bit surprised the replication can't recover after some > files are closed; regular DB gets and puts still worked. >=20 >=20 > On Wed, 15 Mar 2017 19:43:27 -0400 > "Christopher D. Malon" wrote: >=20 >> Those both return=20 >>=20 >> {"error":"not_found","reason":"missing"} >>=20 >> In the latest example, I have a database where the source has >> doc_count 226, the target gets doc_count 222, and the task reports >>=20 >> docs_read: 230 >> docs_written: 230 >> missing_revisions_found: 230 >> revisions_checked: 231 >>=20 >> but the missing documents don't show up as deleted. >>=20 >>=20 >> On Wed, 15 Mar 2017 23:13:57 +0000 >> Robert Samuel Newson wrote: >>=20 >>> Hi, >>>=20 >>> the presence of; >>>=20 >>>>>> docs_read: 12 >>>>>> docs_written: 12 >>>=20 >>> Is what struck me here. the replicator claims to have replicated 12 = docs, which is your expectation and mine, and yet you say they don't = appear in the target. >>>=20 >>> Do you know the doc ids of these missing documents? if so, try GET = /dbname/docid?deleted=3Dtrue and GET /dbname/docid?open_revs=3Dall >>>=20 >>> B. >>>=20 >>>> On 15 Mar 2017, at 18:45, Christopher D. Malon = wrote: >>>>=20 >>>> Could you explain the meaning of source_seq, = checkpointed_source_seq, >>>> and through_seq in more detail? This problem has happened several = times, >>>> with slightly different statuses in _active_tasks, and slightly = different >>>> numbers of documents succesfully copied. On the most recent = attempt, >>>> checkpointed_source_seq and through_seq are 61-* (matching the = source's >>>> update_seq), but source_seq is 0, and just 9 of the 12 documents = are copied. >>>>=20 >>>> When a replication task is in _replicator but is not listed in = _active_tasks >>>> within two minutes, a script of mine deletes the job from = _replicator >>>> and re-submits it. In Couch DB 1.6, this seemed to resolve some = kinds >>>> of stalled replications. Now I wonder if the replication is not = resuming >>>> properly after the deletion and resubmission. >>>>=20 >>>> Christopher >>>>=20 >>>>=20 >>>> On Fri, 10 Mar 2017 06:40:49 +0000 >>>> Robert Newson wrote: >>>>=20 >>>>> Were the six missing documents newer on the target? That is, did = you delete them on the target and expect another replication to restore = them? >>>>>=20 >>>>> Sent from my iPhone >>>>>=20 >>>>>> On 9 Mar 2017, at 22:08, Christopher D. Malon = wrote: >>>>>>=20 >>>>>> I replicated a database (continuously), but ended up with fewer >>>>>> documents in the target than in the source. Even if I wait, >>>>>> the remaining documents don't appear. >>>>>>=20 >>>>>> 1. Here's the DB entry on the source machine, showing 12 = documents: >>>>>>=20 >>>>>> = {"db_name":"library","update_seq":"61-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXLI= yU9OzMnILy7JAUoxJTIkyf___z8rkQGPoiQFIJlkD1bHjE-dA0hdPFgdIz51CSB19WB1BnjU5b= EASYYGIAVUOh-_mRC1CyBq9-P3D0TtAYja-1mJbATVPoCoBbqXKQsA-0Fvaw","sizes":{"fi= le":181716,"external":11524,"active":60098},"purge_seq":0,"other":{"data_s= ize":11524},"doc_del_count":0,"doc_count":12,"disk_size":181716,"disk_form= at_version":6,"data_size":60098,"compact_running":false,"instance_start_ti= me":"0"} >>>>>>=20 >>>>>> 2. Here's the DB entry on the target machine, showing 6 = documents: >>>>>>=20 >>>>>> = {"db_name":"library","update_seq":"6-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXLIy= U9OzMnILy7JAUoxJTIkyf___z8rkQGPoiQFIJlkD1bHhE-dA0hdPFgdIz51CSB19QTV5bEASYY= GIAVUOh-_GyFqF0DU7idG7QGI2vvEqH0AUQvyfxYA1_dvNA","sizes":{"file":82337,"ex= ternal":2282,"active":5874},"purge_seq":0,"other":{"data_size":2282},"doc_= del_count":0,"doc_count":6,"disk_size":82337,"disk_format_version":6,"data= _size":5874,"compact_running":false,"instance_start_time":"0"} >>>>>>=20 >>>>>> 3. Here's _active_tasks for the task, converted to YAML for = readability: >>>>>>=20 >>>>>> - changes_pending: 0 >>>>>> checkpoint_interval: 30000 >>>>>> checkpointed_source_seq: = 61-g1AAAAJTeJyd0EsOgjAQBuAqxsfSE-gRKK08VnIT7UwhSBAWyl >>>>>> = pvojfRm-hNsLQkbAgRNtOkk__L5M8IIcvEkmSNRYmJhDArUGRJcblmajUVBDZVVaWJJchZfSwA= ucPQkW >>>>>> = RV5jKKT3kke-KwVRP2jWBpgdMAwcOuTJ8U1tKhkSZaYhS5x2GodKylWyPZWnJ9QW3KBkr5TE1y= V4_CHu >>>>>> = 1dMeyQ-c4o7Wm0V9u4F9setaM_GzfK2yifWplrxYeAcuGOuulrNN3X1PTFgXPqd-XSHxdwuSQ >>>>>> continuous: !!perl/scalar:JSON::PP::Boolean 1 >>>>>> database: shards/00000000-1fffffff/_replicator.1489086006 >>>>>> doc_id: 172.16.100.222_library >>>>>> doc_write_failures: 0 >>>>>> docs_read: 12 >>>>>> docs_written: 12 >>>>>> missing_revisions_found: 12 >>>>>> node: couchdb@localhost >>>>>> pid: <0.5521.0> >>>>>> replication_id: = c60427215125bd97559d069f6fb3ddb4+continuous+create_target >>>>>> revisions_checked: 12 >>>>>> source: http://172.16.100.222:5984/library/ >>>>>> source_seq: = 61-g1AAAAJTeJyd0EsOgjAQBuAqxsfSE-gRKK08VnIT7UwhSBAWylpvojfRm-hNsLQkbAgRNtO= kk__L5M8IIcvEkmSNRYmJhDArUGRJcblmajUVBDZVVaWJJchZfSwAucPQkWRV5jKKT3kke-KwV= RP2jWBpgdMAwcOuTJ8U1tKhkSZaYhS5x2GodKylWyPZWnJ9QW3KBkr5TE1yV4_CHu1dMeyQ-c4= o7Wm0V9u4F9setaM_GzfK2yifWplrxYeAcuGOuulrNN3X1PTFgXPqd-XSHxdwuSQ >>>>>> started_on: 1489086008 >>>>>> target: http://localhost:5984/library/ >>>>>> through_seq: = 61-g1AAAAJTeJyd0EsOgjAQBuAqxsfSE-gRKK08VnIT7UwhSBAWylpvojfRm-hNsLQkbAgRNtO= kk__L5M8IIcvEkmSNRYmJhDArUGRJcblmajUVBDZVVaWJJchZfSwAucPQkWRV5jKKT3kke-KwV= RP2jWBpgdMAwcOuTJ8U1tKhkSZaYhS5x2GodKylWyPZWnJ9QW3KBkr5TE1yV4_CHu1dMeyQ-c4= o7Wm0V9u4F9setaM_GzfK2yifWplrxYeAcuGOuulrNN3X1PTFgXPqd-XSHxdwuSQ >>>>>> type: replication >>>>>> updated_on: 1489096815 >>>>>> user: peer >>>>>>=20 >>>>>> 4. Here's the _replicator record for the task: >>>>>>=20 >>>>>> = {"_id":"172.16.100.222_library","_rev":"2-8e6cf63bc167c7c7e4bd38242218572c= ","schema":1,"storejson":null,"source":"http://172.16.100.222:5984/library= ","target":"http://localhost:5984/library","create_target":true,"dont_stor= ejson":1,"wholejson":{},"user_ctx":{"roles":["_admin"],"name":"peer"},"con= tinuous":true,"owner":null,"_replication_state":"triggered","_replication_= state_time":"2017-03-09T19:00:08+00:00","_replication_id":"c60427215125bd9= 7559d069f6fb3ddb4"} >>>>>>=20 >>>>>> There should have been no conflicting transactions on the target = host. >>>>>> The appearance of "61-*" in through_seq of the _active_tasks = entry >>>>>> gives me a false sense of security; I only noticed the missing = documents >>>>>> by chance. >>>>>>=20 >>>>>> A fresh replication to a different target succeeded without any >>>>>> missing documents. >>>>>>=20 >>>>>> Is there anything here that would tip me off that the target = wasn't >>>>>> in sync with the source? Is there a good way to resolve the = condition? >>>>>>=20 >>>>>> Thanks, >>>>>> Christopher >>>>>=20 >>>=20