couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enda Farrell (JIRA)" <j...@apache.org>
Subject [jira] Commented: (COUCHDB-416) Replicating shards into a single aggregation node may cause endless respawning
Date Fri, 17 Jul 2009 11:57:14 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732487#action_12732487
] 

Enda Farrell commented on COUCHDB-416:
--------------------------------------

Hi Adam.

Due to the way I have been playing with the environments I am afraid I don't have debug logs
for this particular test.

However ...

As I was attempting to recreate this, I did find a possible bug which has the same symptoms.
Essentially, as I think you're hinting at above, if the source database doesn't exist, the
replication module isn't handling the 404s and endlessly keeps trying. I have an example here
of trying to pull replicate from a source that's known to NOT exist:

***********************************************************************************************************************************
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.141.0>] 'GET' /social/ {1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.141.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
** 
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.141.0>] 10.10.10.15 - - 'GET' /social/
404
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2451.0>] 'GET' /social/_local%2F5ba3add9594c40bb0b0480ff454d89a2
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2451.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2451.0>] 10.10.10.15 - - 'GET' /social/_local%2F5ba3add9594c40bb0b0480ff454d89a2
404
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2454.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2454.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
** 
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2454.0>] 10.10.10.15 - - 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
404
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2457.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2457.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2457.0>] 10.10.10.15 - - 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
404
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2460.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2460.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
** 
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [info] [<0.2460.0>] 10.10.10.15 - - 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
404
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2463.0>] 'GET' /social/_all_docs_by_seq?limit=100&startkey=0
{1,1}
** Headers: [{'Host',"10.10.10.16:5984"}]
** 
** [Thu, 16 Jul 2009 16:23:05 GMT] [debug] [<0.2463.0>] httpd 404 error response:
** {"error":"not_found","reason":"Missing"}
** 
**                                                                                       
                                    
***********************************************************************************************************************************

Is thins a known bug? If not I can spin one up.


It *is* possible that the above is the bug, and that it has nothing to do with replicating
the same database-name from many sources into a single (aggregator) database. I have not yet
been able to rule it in or out. Never-the-less, I'll be changing our "replication controller"
code to be mindful of this issue.


I

> Replicating shards into a single aggregation node may cause endless respawning
> ------------------------------------------------------------------------------
>
>                 Key: COUCHDB-416
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-416
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: couchdb 0.9.0.r766883 CentOS x86_64
>            Reporter: Enda Farrell
>            Assignee: Adam Kocoloski
>            Priority: Critical
>         Attachments: Picture 2.png
>
>
> I have a set of CouchDB instances, each one acting as a shard for a large set of data.
> Ocassionally, we replicate each instances' database into a different CouchDB instance.
We always "pull" replicate (see image attached)
> When we do this, we often see errors like this on the target instance:
> * [Thu, 16 Jul 2009 13:52:32 GMT] [error] [emulator] Error in process <0.29787.102>
with exit value: {function_clause,[{lists,map,[#Fun<couch_rep.6.75683565>,undefined]},{couch_rep,enum_docs_since,4}]}
> * 
> * 
> * 
> * [Thu, 16 Jul 2009 13:52:32 GMT] [error] [<0.7456.6>] replication enumerator exited
with {function_clause,
> *                                     [{lists,map,
> *                                       [#Fun<couch_rep.6.75683565>,undefined]},
> *                                      {couch_rep,enum_docs_since,4}]} .. respawning
> Once this starts, it is fatal to the CouchDB instance. It logs these messages at over
1000 per second (log level = severe) and chews up HDD.
> No errors (other than a HTTP timeout) are seen.
> After a database had gone "respawning",  the target node was shutdown, logs cleared,
target node restarted. Log was tailed - all was quiet. Once a single replication was called
again against this database it again immediatly went into respawning hell. There were no stacked
replications in this case.
> From this it seems that - if a database ever goes into "respawning" it cannot recover
(when your enviroment/setup requires replication to occur always).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message