[ https://issues.apache.org/jira/browse/COUCHDB-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732487#action_12732487
]
Enda Farrell commented on COUCHDB-416:
--------------------------------------
Hi Adam.
Due to the way I have been playing with the environments I am afraid I don't have debug logs
for this particular test.
However ...
As I was attempting to recreate this, I did find a possible bug which has the same symptoms.
Essentially, as I think you're hinting at above, if the source database doesn't exist, the
replication module isn't handling the 404s and endlessly keeps trying. I have an example here
of trying to pull replicate from a source that's known to NOT exist:
***********************************************************************************************************************************
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.141.0>] 'GET' /social/ {1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.141.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
**
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.141.0>] 10.10.10.15 - - 'GET' /social/
404
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2451.0>] 'GET' /social/_local%2F5ba3add9594c40bb0b0480ff454d89a2
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2451.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2451.0>] 10.10.10.15 - - 'GET' /social/_local%2F5ba3add9594c40bb0b0480ff454d89a2
404
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2454.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2454.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
**
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2454.0>] 10.10.10.15 - - 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
404
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2457.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2457.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2457.0>] 10.10.10.15 - - 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
404
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2460.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2460.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
**
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2460.0>] 10.10.10.15 - - 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
404
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2463.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
**
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2463.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
**
**
***********************************************************************************************************************************
Is thins a known bug? If not I can spin one up.
It *is* possible that the above is the bug, and that it has nothing to do with replicating
the same database-name from many sources into a single (aggregator) database. I have not yet
been able to rule it in or out. Never-the-less, I'll be changing our "replication controller"
code to be mindful of this issue.
I
> Replicating shards into a single aggregation node may cause endless respawning
> ------------------------------------------------------------------------------
>
> Key: COUCHDB-416
> URL: https://issues.apache.org/jira/browse/COUCHDB-416
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.9
> Environment: couchdb 0.9.0.r766883 CentOS x86_64
> Reporter: Enda Farrell
> Assignee: Adam Kocoloski
> Priority: Critical
> Attachments: Picture 2.png
>
>
> I have a set of CouchDB instances, each one acting as a shard for a large set of data.
> Ocassionally, we replicate each instances' database into a different CouchDB instance.
We always "pull" replicate (see image attached)
> When we do this, we often see errors like this on the target instance:
> * [Thu, 16 Jul 2009 13:52:32 GMT] [error] [emulator] Error in process <0.29787.102>
with exit value: {function_clause,[{lists,map,[#Fun<couch_rep.6.75683565>,undefined]},{couch_rep,enum_docs_since,4}]}
> *
> *
> *
> * [Thu, 16 Jul 2009 13:52:32 GMT] [error] [<0.7456.6>] replication enumerator exited
with {function_clause,
> * [{lists,map,
> * [#Fun<couch_rep.6.75683565>,undefined]},
> * {couch_rep,enum_docs_since,4}]} .. respawning
> Once this starts, it is fatal to the CouchDB instance. It logs these messages at over
1000 per second (log level = severe) and chews up HDD.
> No errors (other than a HTTP timeout) are seen.
> After a database had gone "respawning", the target node was shutdown, logs cleared,
target node restarted. Log was tailed - all was quiet. Once a single replication was called
again against this database it again immediatly went into respawning hell. There were no stacked
replications in this case.
> From this it seems that - if a database ever goes into "respawning" it cannot recover
(when your enviroment/setup requires replication to occur always).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|