couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randall Leeds (JIRA)" <j...@apache.org>
Subject [jira] Commented: (COUCHDB-597) Replication tasks crash.
Date Fri, 26 Mar 2010 20:24:27 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850335#action_12850335
] 

Randall Leeds commented on COUCHDB-597:
---------------------------------------

Re-opening. Still happening on 0.11, but for different reasons.

Germain reports this from his log on the user@ list:

[Fri, 26 Mar 2010 09:55:01 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request
in 16.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:56:13 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request
in 32.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:57:42 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request
in 64.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:59:40 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request
in 128.0 seconds due to {error, req_timedout}

In my experience with this in production, I've seen put requests get stalled here writing
the checkpoint document. I'm guessing the log above is the _ensure_full_commit failing (since
that's the only post in replication I think). In my logs I see 409 conflicts writing the remote
checkpoint document but only timeouts on the receiving side of those conflict PUTs. I'm not
sure why the conflict doesn't bubble up to couch_rep. My first guess is that maybe we're not
asking ibrowse to stream the next chunk in some code path and the remote side has sent a response
that we never retrieve.

> Replication tasks crash.
> ------------------------
>
>                 Key: COUCHDB-597
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-597
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>            Reporter: Robert Newson
>             Fix For: 0.11
>
>         Attachments: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch,
0001-Cleanup-597-fixes.patch, 597_fixes.patch, couchdb_597.patch
>
>
> If I kick off 10 replication tasks in quick succession, occasionally one or two of the
replication tasks will die and not be resumed. It seems that the stat tracking is a little
buggy, and under stress can eventually cause a permanent failure of the supervised replication
task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
>     {<0.80.0>,supervisor_report,
>      [{supervisor,{local,couch_rep_sup}},
>       {errorContext,shutdown_error},
>       {reason,killed},
>       {offender,
>           [{pid,<0.6700.11>},
>            {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
>            {mfa,
>                {gen_server,start_link,
>                    [couch_rep,
>                     ["fcbb13200a1618cf983b347f4d2c9835",
>                      {[{<<"create_target">>,true},
>                        {<<"source">>,<<"http://node:5984/perf-p2">>},
>                        {<<"target">>,<<"perf-p2">>}]},
>                      {user_ctx,null,[<<"_admin">>]}],
>                     []]}},
>            {restart_type,temporary},
>            {shutdown,1},
>            {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process <0.6705.11>
with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message