couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kocoloski (JIRA)" <>
Subject [jira] Updated: (COUCHDB-597) Replication tasks crash.
Date Sun, 20 Dec 2009 03:00:18 GMT


Adam Kocoloski updated COUCHDB-597:

Hi Robert, I can reproduce the crashes locally and I've discovered why they happen independently
of the {ref(), integer()} problem.  The basic issue is that attachment downloads do not employ
the same retry checks that we do for regular document GETs.  For instance, the attachment
receiver process associated with a replication would be waiting an infinite amount for response
headers, when in fact it had an error message in its mailbox informing it that the request
had failed.  Eventually the changes feed times out and the replication crashes.

If I apply, crank up the changes feed timeout,
and add the catchall handle_infos we've talked about before I can successfully run the script
you posted here.  We have more work to do, though, namely

1) Reworking the changes feed timeout.  Currently it will trigger if there is no activity
for X milliseconds on the connection handling the _changes feed.  There are situations where
this is actually normal, since the changes feed consumer is responsible for controlling the
socket, and if the target is _really_ slow (or the documents are huge) it's quite possible
that the changes feed will not be consulted for a long time.  I think the solution is to handle
inactivity timeouts in couch_rep_changes_feed.erl instead of in the underlying ibrowse system.

2a) Attachment retry logic that handles redirects and limits the number of retries.  Basically,
the same code as we have in couch_rep_httpc, but only applied until we receive the response
headers.  My friendpaste above is a primitive form of what I'd ultimately like to see here.

2b) When an attachment body download has started and then fails, we can't simply retry it.
 We need to do a Range request or find another way to skip the first N bytes of the retry.
 Currently we just give up on the entire replication if an attachment request ever fails mid-download.

> Replication tasks crash.
> ------------------------
>                 Key: COUCHDB-597
>                 URL:
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>            Reporter: Robert Newson
> If I kick off 10 replication tasks in quick succession, occasionally one or two of the
replication tasks will die and not be resumed. It seems that the stat tracking is a little
buggy, and under stress can eventually cause a permanent failure of the supervised replication
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
>     {<0.80.0>,supervisor_report,
>      [{supervisor,{local,couch_rep_sup}},
>       {errorContext,shutdown_error},
>       {reason,killed},
>       {offender,
>           [{pid,<0.6700.11>},
>            {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
>            {mfa,
>                {gen_server,start_link,
>                    [couch_rep,
>                     ["fcbb13200a1618cf983b347f4d2c9835",
>                      {[{<<"create_target">>,true},
>                        {<<"source">>,<<"http://node:5984/perf-p2">>},
>                        {<<"target">>,<<"perf-p2">>}]},
>                      {user_ctx,null,[<<"_admin">>]}],
>                     []]}},
>            {restart_type,temporary},
>            {shutdown,1},
>            {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process <0.6705.11>
with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message