couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randall Leeds (JIRA)" <j...@apache.org>
Subject [jira] Updated: (COUCHDB-597) Replication tasks crash.
Date Thu, 25 Feb 2010 10:07:28 GMT

     [ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Randall Leeds updated COUCHDB-597:
----------------------------------

    Attachment: couchdb_597.patch

I believe this patch fixes most of the problems we're seeing here.

The solution, as discussed, is to remove the inactivity_timeout from options passed to ibrowse
and handle timeouts manually (here using the timer module).

In my testing, I could mostly reproduce timeouts caused by not reading data from ibrowse fast
enough. In other words, replicating from a remote database was terminating because processing
the changes was taking a long time to complete and the socket would be inactive while couch_rep_changes_feed
had a full queue of rows. Therefore, a timeout is not set unless the missing revs server is
waiting for more changes.

Timeouts should still occur if the socket is idle and the local queue of received changes
is empty. Errors should be caught appropriately such that real problems still bubble.

I implemented retry logic for attachments in a manner similar to couch_rep_httpc. I had to
add some after statements now that the inactivity_timeout is not set.

The patch applies cleanly to trunk and 0.11.x, so please review!!! I think this would be a
very good patch to get into 0.11 so long as Noah hasn't built the artifacts yet.

> Replication tasks crash.
> ------------------------
>
>                 Key: COUCHDB-597
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-597
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>            Reporter: Robert Newson
>         Attachments: couchdb_597.patch
>
>
> If I kick off 10 replication tasks in quick succession, occasionally one or two of the
replication tasks will die and not be resumed. It seems that the stat tracking is a little
buggy, and under stress can eventually cause a permanent failure of the supervised replication
task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
>     {<0.80.0>,supervisor_report,
>      [{supervisor,{local,couch_rep_sup}},
>       {errorContext,shutdown_error},
>       {reason,killed},
>       {offender,
>           [{pid,<0.6700.11>},
>            {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
>            {mfa,
>                {gen_server,start_link,
>                    [couch_rep,
>                     ["fcbb13200a1618cf983b347f4d2c9835",
>                      {[{<<"create_target">>,true},
>                        {<<"source">>,<<"http://node:5984/perf-p2">>},
>                        {<<"target">>,<<"perf-p2">>}]},
>                      {user_ctx,null,[<<"_admin">>]}],
>                     []]}},
>            {restart_type,temporary},
>            {shutdown,1},
>            {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process <0.6705.11>
with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message