couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Markham (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (COUCHDB-1364) Replication hanging/failing on docs with lots of revisions
Date Thu, 15 Dec 2011 14:20:30 GMT

     [ https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Markham updated COUCHDB-1364:
----------------------------------

    Attachment: do_checkpoint error push.txt

Hi Felipe - which couch (on which end of the replication) needs to be updated?

I looked at wireshark for the pull and push replication, from host28 -> host25
For the Pull - the replication seems to start, fetch the changes list from seq 390505 and
then POSTs an ensure full commit. There doesn't seem to be a reply from this so it just ends
up hanging. My replication script cancels the ongoing replication and then restarts it every
5 mins and this seems to take much longer than that.

POST /master_db/_ensure_full_commit?seq=3914198 HTTP/1.1
User-Agent: CouchDB/1.1.1
Accept: application/json
Accept-Encoding: gzip
Content-Type: application/json
Content-Length: 0
Host: host28:5984

I also have a different stack trace for what I think is the same problem - "do_checkpoint
error.txt" where the last wireshark activity appeared to be a /_ensure_full_commit at 12:12:04
and at 12:12:34 the timeout error appeared and replication failed

POST /master_db/_ensure_full_commit HTTP/1.1
User-Agent: CouchDB/1.1.1
Accept: application/json
Accept-Encoding: gzip
Content-Type: application/json
Content-Length: 0
Host: host25:5984

                
> Replication hanging/failing on docs with lots of revisions
> ----------------------------------------------------------
>
>                 Key: COUCHDB-1364
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1364
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.0.3, 1.1.1
>         Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched for COUCHDB-1340
and COUCHDB-1333
>            Reporter: Alex Markham
>              Labels: open_revs, replication
>         Attachments: COUCHDB-1364-11x.patch, do_checkpoint error push.txt, replication
error changes_loop died redacted.txt
>
>
> We have a setup where replication from a 1.1.1 couch is hanging - this is WAN replication
which previously worked 1.0.3 <-> 1.0.3.
> Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to COUCHDB-1340 -
which I presumed meant the url was too long. So I upgraded the 1.0.3 couch to our 1.1.1 build
which had this patched.
> However - the replication between the 2 1.1.1 couches is hanging at a certain point when
doing continuous pull replication - it doesn't checkpoint, just stays on "starting" however,
when cancelled and restarted it gets the latest documents (so doc counts are equal). The last
calls I see to the source db when it hangs are multiple long GETs for a document with 2051
open revisions on the source and 498 on the target.
> When doing a push replication the _replicate call just gives a 500 error (at about the
same seq id as the pull replication hangs at) saying:
> [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with
reason {noproc,
>                                                        {gen_server,call,
>                                                         [<0.6382.115>,
>                                                          {pread_iolist,
>                                                           79043596434},
>                                                          infinity]}}
> when the last call in the target of the push replication is:
> [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' /master_db/_missing_revs
200
> with no stack trace.
> Comparing the open_revs=all count on the documents with many open revs shows differing
numbers on each side of the replication WAN and between different couches in the same datacentre.
Some of these documents have not been updated for months. Is it possible that 1.0.3 just skipped
over this issue and carried on replicating, but 1.1.1 does not?
> I know I can hack the replication to work by updating the checkpoint seq past this point
in the _local document, but I think there is a real bug here somewhere.
> If wireshark/debug data is required, please say

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message