incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: CouchDB pegging the CPU and not responding to requests
Date Tue, 01 Sep 2009 17:49:17 GMT
On Tue, Sep 1, 2009 at 7:52 AM, John Wood<john@interactivemediums.com> wrote:
> Hi everybody,
>
> I'm currently facing an issue with our production installation of CouchDB.
> Two times within the past 5 days, the Erlang process running CouchDB pegs
> one of the 4 cores on the machine, consumes about 40% of the system RAM
> (which is 4GB), and becomes completely unresponsive to incoming HTTP
> requests.  The only way we can get it back to normal is to restart CouchDB.
>
> I'm trying to determine what may be causing this, but I'm not having much
> luck.  Nothing stands out in the CouchDB log files.  I can see that there
> are no entries in the log files from the time it goes unresponsive until the
> time I restart it.  Besides that, there doesn't appear to be any errors
> leading up to the issue.  There are however a few errors like the one below,
> but none right before CouchDB goes unresponsive:
>
> [error] [<0.11738.288>] {error_report,<0.21.0>,
>    {<0.11738.288>,crash_report,
>     [[{pid,<0.11738.288>},
>       {registered_name,[]},
>       {error_info,
>           {error,
>               {case_clause,{error,enotconn}},
>               [{mochiweb_request,get,2},
>                {couch_httpd,handle_request,4},
>                {mochiweb_http,headers,5},
>                {proc_lib,init_p,5}]}},
>       {initial_call,
>           {mochiweb_socket_server,acceptor_loop,
>               [{<0.56.0>,#Port<0.148>,#Fun<mochiweb_http.1.81679042>}]}},
>       {ancestors,
>           [couch_httpd,couch_secondary_services,couch_server_sup,
>            <0.1.0>]},
>       {messages,[]},
>       {links,[<0.56.0>,#Port<0.5032425>]},
>       {dictionary,[{mochiweb_request_qs,[{"limit","0"}]}]},
>       {trap_exit,false},
>       {status,running},
>       {heap_size,28657},
>       {stack_size,23},
>       {reductions,14034}],
>      []]}}
> [error] [<0.56.0>] {error_report,<0.21.0>,
>    {<0.56.0>,std_error,
>     {mochiweb_socket_server,235,
>         {child_error,{case_clause,{error,enotconn}}}}}}
>
> =ERROR REPORT==== 30-Aug-2009::04:29:07 ===
> {mochiweb_socket_server,235,
>                        {child_error,{case_clause,{error,enotconn}}}}
>
> I checked some of the other system log files (/var/log/messages, etc), and
> there doesn't appear to be any information there either.
>
> Our CouchDB installation is fairly large.  We have 7 production databases,
> totaling almost 250GB.  The largest database is 129GB.  We are running
> CouchDB 0.9.0 on Red Hat Enterprise Server 5.3.  As far as usage goes, we
> are constantly inserting documents into the database (5,000 at a time via a
> bulk insert), and pausing to regenerate the views after 100,000 documents
> have been inserted.  Besides for the process that does the inserts, all
> views are accessed using stale=ok.
>
> Has anybody else faced a similar issue?  Can anybody suggest tips regarding
> how I should go about diagnosing this issue?
>

Just a guess, based on the information available here, but the
enotconn error suggests that the remote client is dropping the
connection prematurely. There is an old bug about this in the tracker,
which might be a good thing to reopen if we learn much more about the
issue (and it is still present in trunk / 0.10):

http://issues.apache.org/jira/browse/COUCHDB-45

There is also this open bug which could be related:

https://issues.apache.org/jira/browse/COUCHDB-394

Perhaps you have clients who aren't properly closing the connection,
and them somehow this is running up against a limit in the underlying
server system (max number of connections, or maybe even max number of
erlang processes in the vm).

It would be nice to get to the bottom of this one, eventually.

The first step I'd suggest taking is attempting to reproduce on the
0.10.x branch from svn. This will at least tell us if the bug has been
fixed. If it's still around and repeatable, that will give us a test
case for finally crushing it into oblivion.

It might help to know more about which client library you are using,
as this bug seems to depend on the TCP behavior of clients.

Chris

> Thanks,
> John
>
> --
> John Wood
> Interactive Mediums
> john@interactivemediums.com
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message