couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Samuel Newson <rnew...@apache.org>
Subject Re: incomplete replication under 2.0.0
Date Fri, 24 Mar 2017 16:03:44 GMT
sorry for late reply.

That's very curious. Can you file a JIRA for this? If the replicator says it replicated to
the target, that should always be true. I can't immediately think why emfile would wreck that
(I'd expect the writes to either fail or succeed and for the replicator to agree).

B.


> On 21 Mar 2017, at 16:26, Christopher D. Malon <malon@groupring.net> wrote:
> 
> These problems appear to be due to the replicator crashing
> with {error,{conn_failed,{error,emfile}}}, which apparently
> means that I surpassed an open file limit.
> 
> The replications were successful if I executed
> 
> ulimit -Sn 4096
> 
> prior to launching CouchDB, in the same shell.
> 
> I'm a bit surprised the replication can't recover after some
> files are closed; regular DB gets and puts still worked.
> 
> 
> On Wed, 15 Mar 2017 19:43:27 -0400
> "Christopher D. Malon" <malon@groupring.net> wrote:
> 
>> Those both return 
>> 
>> {"error":"not_found","reason":"missing"}
>> 
>> In the latest example, I have a database where the source has
>> doc_count 226, the target gets doc_count 222, and the task reports
>> 
>>  docs_read: 230
>>  docs_written: 230
>>  missing_revisions_found: 230
>>  revisions_checked: 231
>> 
>> but the missing documents don't show up as deleted.
>> 
>> 
>> On Wed, 15 Mar 2017 23:13:57 +0000
>> Robert Samuel Newson <rnewson@apache.org> wrote:
>> 
>>> Hi,
>>> 
>>> the presence of;
>>> 
>>>>>> docs_read: 12
>>>>>> docs_written: 12
>>> 
>>> Is what struck me here. the replicator claims to have replicated 12 docs, which
is your expectation and mine, and yet you say they don't appear in the target.
>>> 
>>> Do you know the doc ids of these missing documents? if so, try GET /dbname/docid?deleted=true
and GET /dbname/docid?open_revs=all
>>> 
>>> B.
>>> 
>>>> On 15 Mar 2017, at 18:45, Christopher D. Malon <malon@groupring.net>
wrote:
>>>> 
>>>> Could you explain the meaning of source_seq, checkpointed_source_seq,
>>>> and through_seq in more detail?  This problem has happened several times,
>>>> with slightly different statuses in _active_tasks, and slightly different
>>>> numbers of documents succesfully copied.  On the most recent attempt,
>>>> checkpointed_source_seq and through_seq are 61-* (matching the source's
>>>> update_seq), but source_seq is 0, and just 9 of the 12 documents are copied.
>>>> 
>>>> When a replication task is in _replicator but is not listed in _active_tasks
>>>> within two minutes, a script of mine deletes the job from _replicator
>>>> and re-submits it.  In Couch DB 1.6, this seemed to resolve some kinds
>>>> of stalled replications.  Now I wonder if the replication is not resuming
>>>> properly after the deletion and resubmission.
>>>> 
>>>> Christopher
>>>> 
>>>> 
>>>> On Fri, 10 Mar 2017 06:40:49 +0000
>>>> Robert Newson <rnewson@apache.org> wrote:
>>>> 
>>>>> Were the six missing documents newer on the target? That is, did you
delete them on the target and expect another replication to restore them?
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 9 Mar 2017, at 22:08, Christopher D. Malon <malon@groupring.net>
wrote:
>>>>>> 
>>>>>> I replicated a database (continuously), but ended up with fewer
>>>>>> documents in the target than in the source.  Even if I wait,
>>>>>> the remaining documents don't appear.
>>>>>> 
>>>>>> 1. Here's the DB entry on the source machine, showing 12 documents:
>>>>>> 
>>>>>> {"db_name":"library","update_seq":"61-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rkQGPoiQFIJlkD1bHjE-dA0hdPFgdIz51CSB19WB1BnjU5bEASYYGIAVUOh-_mRC1CyBq9-P3D0TtAYja-1mJbATVPoCoBbqXKQsA-0Fvaw","sizes":{"file":181716,"external":11524,"active":60098},"purge_seq":0,"other":{"data_size":11524},"doc_del_count":0,"doc_count":12,"disk_size":181716,"disk_format_version":6,"data_size":60098,"compact_running":false,"instance_start_time":"0"}
>>>>>> 
>>>>>> 2. Here's the DB entry on the target machine, showing 6 documents:
>>>>>> 
>>>>>> {"db_name":"library","update_seq":"6-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rkQGPoiQFIJlkD1bHhE-dA0hdPFgdIz51CSB19QTV5bEASYYGIAVUOh-_GyFqF0DU7idG7QGI2vvEqH0AUQvyfxYA1_dvNA","sizes":{"file":82337,"external":2282,"active":5874},"purge_seq":0,"other":{"data_size":2282},"doc_del_count":0,"doc_count":6,"disk_size":82337,"disk_format_version":6,"data_size":5874,"compact_running":false,"instance_start_time":"0"}
>>>>>> 
>>>>>> 3. Here's _active_tasks for the task, converted to YAML for readability:
>>>>>> 
>>>>>> - changes_pending: 0
>>>>>> checkpoint_interval: 30000
>>>>>> checkpointed_source_seq: 61-g1AAAAJTeJyd0EsOgjAQBuAqxsfSE-gRKK08VnIT7UwhSBAWyl
>>>>>> pvojfRm-hNsLQkbAgRNtOkk__L5M8IIcvEkmSNRYmJhDArUGRJcblmajUVBDZVVaWJJchZfSwAucPQkW
>>>>>> RV5jKKT3kke-KwVRP2jWBpgdMAwcOuTJ8U1tKhkSZaYhS5x2GodKylWyPZWnJ9QW3KBkr5TE1yV4_CHu
>>>>>> 1dMeyQ-c4o7Wm0V9u4F9setaM_GzfK2yifWplrxYeAcuGOuulrNN3X1PTFgXPqd-XSHxdwuSQ
>>>>>> continuous: !!perl/scalar:JSON::PP::Boolean 1
>>>>>> database: shards/00000000-1fffffff/_replicator.1489086006
>>>>>> doc_id: 172.16.100.222_library
>>>>>> doc_write_failures: 0
>>>>>> docs_read: 12
>>>>>> docs_written: 12
>>>>>> missing_revisions_found: 12
>>>>>> node: couchdb@localhost
>>>>>> pid: <0.5521.0>
>>>>>> replication_id: c60427215125bd97559d069f6fb3ddb4+continuous+create_target
>>>>>> revisions_checked: 12
>>>>>> source: http://172.16.100.222:5984/library/
>>>>>> source_seq: 61-g1AAAAJTeJyd0EsOgjAQBuAqxsfSE-gRKK08VnIT7UwhSBAWylpvojfRm-hNsLQkbAgRNtOkk__L5M8IIcvEkmSNRYmJhDArUGRJcblmajUVBDZVVaWJJchZfSwAucPQkWRV5jKKT3kke-KwVRP2jWBpgdMAwcOuTJ8U1tKhkSZaYhS5x2GodKylWyPZWnJ9QW3KBkr5TE1yV4_CHu1dMeyQ-c4o7Wm0V9u4F9setaM_GzfK2yifWplrxYeAcuGOuulrNN3X1PTFgXPqd-XSHxdwuSQ
>>>>>> started_on: 1489086008
>>>>>> target: http://localhost:5984/library/
>>>>>> through_seq: 61-g1AAAAJTeJyd0EsOgjAQBuAqxsfSE-gRKK08VnIT7UwhSBAWylpvojfRm-hNsLQkbAgRNtOkk__L5M8IIcvEkmSNRYmJhDArUGRJcblmajUVBDZVVaWJJchZfSwAucPQkWRV5jKKT3kke-KwVRP2jWBpgdMAwcOuTJ8U1tKhkSZaYhS5x2GodKylWyPZWnJ9QW3KBkr5TE1yV4_CHu1dMeyQ-c4o7Wm0V9u4F9setaM_GzfK2yifWplrxYeAcuGOuulrNN3X1PTFgXPqd-XSHxdwuSQ
>>>>>> type: replication
>>>>>> updated_on: 1489096815
>>>>>> user: peer
>>>>>> 
>>>>>> 4. Here's the _replicator record for the task:
>>>>>> 
>>>>>> {"_id":"172.16.100.222_library","_rev":"2-8e6cf63bc167c7c7e4bd38242218572c","schema":1,"storejson":null,"source":"http://172.16.100.222:5984/library","target":"http://localhost:5984/library","create_target":true,"dont_storejson":1,"wholejson":{},"user_ctx":{"roles":["_admin"],"name":"peer"},"continuous":true,"owner":null,"_replication_state":"triggered","_replication_state_time":"2017-03-09T19:00:08+00:00","_replication_id":"c60427215125bd97559d069f6fb3ddb4"}
>>>>>> 
>>>>>> There should have been no conflicting transactions on the target
host.
>>>>>> The appearance of "61-*" in through_seq of the _active_tasks entry
>>>>>> gives me a false sense of security; I only noticed the missing documents
>>>>>> by chance.
>>>>>> 
>>>>>> A fresh replication to a different target succeeded without any
>>>>>> missing documents.
>>>>>> 
>>>>>> Is there anything here that would tip me off that the target wasn't
>>>>>> in sync with the source?  Is there a good way to resolve the condition?
>>>>>> 
>>>>>> Thanks,
>>>>>> Christopher
>>>>> 
>>> 


Mime
View raw message