Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: unknown (nike.apache.org: error in processing during lookup of
 john@interactivemediums.com)
MIME-Version: 1.0
From: John Wood <john@interactivemediums.com>
Date: Tue, 1 Sep 2009 09:52:27 -0500
Message-ID: <a54e31e40909010752i23dd7f97s36cba1f96fc7651a@mail.gmail.com>
Subject: CouchDB pegging the CPU and not responding to requests
To: user@couchdb.apache.org
Content-Type: multipart/alternative; boundary=000e0cd4d300bc8cda0472854dff

--000e0cd4d300bc8cda0472854dff
Content-Type: text/plain; charset=ISO-8859-1

Hi everybody,

I'm currently facing an issue with our production installation of CouchDB.
Two times within the past 5 days, the Erlang process running CouchDB pegs
one of the 4 cores on the machine, consumes about 40% of the system RAM
(which is 4GB), and becomes completely unresponsive to incoming HTTP
requests.  The only way we can get it back to normal is to restart CouchDB.

I'm trying to determine what may be causing this, but I'm not having much
luck.  Nothing stands out in the CouchDB log files.  I can see that there
are no entries in the log files from the time it goes unresponsive until the
time I restart it.  Besides that, there doesn't appear to be any errors
leading up to the issue.  There are however a few errors like the one below,
but none right before CouchDB goes unresponsive:

[error] [<0.11738.288>] {error_report,<0.21.0>,
    {<0.11738.288>,crash_report,
     [[{pid,<0.11738.288>},
       {registered_name,[]},
       {error_info,
           {error,
               {case_clause,{error,enotconn}},
               [{mochiweb_request,get,2},
                {couch_httpd,handle_request,4},
                {mochiweb_http,headers,5},
                {proc_lib,init_p,5}]}},
       {initial_call,
           {mochiweb_socket_server,acceptor_loop,
               [{<0.56.0>,#Port<0.148>,#Fun<mochiweb_http.1.81679042>}]}},
       {ancestors,
           [couch_httpd,couch_secondary_services,couch_server_sup,
            <0.1.0>]},
       {messages,[]},
       {links,[<0.56.0>,#Port<0.5032425>]},
       {dictionary,[{mochiweb_request_qs,[{"limit","0"}]}]},
       {trap_exit,false},
       {status,running},
       {heap_size,28657},
       {stack_size,23},
       {reductions,14034}],
      []]}}
[error] [<0.56.0>] {error_report,<0.21.0>,
    {<0.56.0>,std_error,
     {mochiweb_socket_server,235,
         {child_error,{case_clause,{error,enotconn}}}}}}

=ERROR REPORT==== 30-Aug-2009::04:29:07 ===
{mochiweb_socket_server,235,
                        {child_error,{case_clause,{error,enotconn}}}}

I checked some of the other system log files (/var/log/messages, etc), and
there doesn't appear to be any information there either.

Our CouchDB installation is fairly large.  We have 7 production databases,
totaling almost 250GB.  The largest database is 129GB.  We are running
CouchDB 0.9.0 on Red Hat Enterprise Server 5.3.  As far as usage goes, we
are constantly inserting documents into the database (5,000 at a time via a
bulk insert), and pausing to regenerate the views after 100,000 documents
have been inserted.  Besides for the process that does the inserts, all
views are accessed using stale=ok.

Has anybody else faced a similar issue?  Can anybody suggest tips regarding
how I should go about diagnosing this issue?

Thanks,
John

-- 
John Wood
Interactive Mediums
john@interactivemediums.com

--000e0cd4d300bc8cda0472854dff--