Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of jchris@gmail.com designates
 209.85.212.171 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=YwlikXyB7JXyXkXyG5OKlb57FuCJ9jMrnK3AwdE8ubngeWhCXAnuoxTPsOZNmiVQOb
         GC7bh/OCqiSdH7XRhTt9MNvAlfckRK3KoGz0a6hqgB5KQSaqs3fej+QOEGV4pbzIj8n1
         IrGqrY2ulVANKzX5yjWtWwhTRXcpCg4dZRtg0=
MIME-Version: 1.0
Sender: jchris@gmail.com
In-Reply-To: <a54e31e40909010752i23dd7f97s36cba1f96fc7651a@mail.gmail.com>
References: <a54e31e40909010752i23dd7f97s36cba1f96fc7651a@mail.gmail.com>
Date: Tue, 1 Sep 2009 10:49:17 -0700
Message-ID: <e282921e0909011049u6dcedb8ehf132a308e73e5d4c@mail.gmail.com>
Subject: Re: CouchDB pegging the CPU and not responding to requests
From: Chris Anderson <jchris@apache.org>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, Sep 1, 2009 at 7:52 AM, John Wood<john@interactivemediums.com> wrot=
e:
> Hi everybody,
>
> I'm currently facing an issue with our production installation of CouchDB=
.
> Two times within the past 5 days, the Erlang process running CouchDB pegs
> one of the 4 cores on the machine, consumes about 40% of the system RAM
> (which is 4GB), and becomes completely unresponsive to incoming HTTP
> requests. =A0The only way we can get it back to normal is to restart Couc=
hDB.
>
> I'm trying to determine what may be causing this, but I'm not having much
> luck. =A0Nothing stands out in the CouchDB log files. =A0I can see that t=
here
> are no entries in the log files from the time it goes unresponsive until =
the
> time I restart it. =A0Besides that, there doesn't appear to be any errors
> leading up to the issue. =A0There are however a few errors like the one b=
elow,
> but none right before CouchDB goes unresponsive:
>
> [error] [<0.11738.288>] {error_report,<0.21.0>,
> =A0 =A0{<0.11738.288>,crash_report,
> =A0 =A0 [[{pid,<0.11738.288>},
> =A0 =A0 =A0 {registered_name,[]},
> =A0 =A0 =A0 {error_info,
> =A0 =A0 =A0 =A0 =A0 {error,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 {case_clause,{error,enotconn}},
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{mochiweb_request,get,2},
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{couch_httpd,handle_request,4},
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{mochiweb_http,headers,5},
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{proc_lib,init_p,5}]}},
> =A0 =A0 =A0 {initial_call,
> =A0 =A0 =A0 =A0 =A0 {mochiweb_socket_server,acceptor_loop,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{<0.56.0>,#Port<0.148>,#Fun<mochiweb_http.1.=
81679042>}]}},
> =A0 =A0 =A0 {ancestors,
> =A0 =A0 =A0 =A0 =A0 [couch_httpd,couch_secondary_services,couch_server_su=
p,
> =A0 =A0 =A0 =A0 =A0 =A0<0.1.0>]},
> =A0 =A0 =A0 {messages,[]},
> =A0 =A0 =A0 {links,[<0.56.0>,#Port<0.5032425>]},
> =A0 =A0 =A0 {dictionary,[{mochiweb_request_qs,[{"limit","0"}]}]},
> =A0 =A0 =A0 {trap_exit,false},
> =A0 =A0 =A0 {status,running},
> =A0 =A0 =A0 {heap_size,28657},
> =A0 =A0 =A0 {stack_size,23},
> =A0 =A0 =A0 {reductions,14034}],
> =A0 =A0 =A0[]]}}
> [error] [<0.56.0>] {error_report,<0.21.0>,
> =A0 =A0{<0.56.0>,std_error,
> =A0 =A0 {mochiweb_socket_server,235,
> =A0 =A0 =A0 =A0 {child_error,{case_clause,{error,enotconn}}}}}}
>
> =3DERROR REPORT=3D=3D=3D=3D 30-Aug-2009::04:29:07 =3D=3D=3D
> {mochiweb_socket_server,235,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{child_error,{case_clause,=
{error,enotconn}}}}
>
> I checked some of the other system log files (/var/log/messages, etc), an=
d
> there doesn't appear to be any information there either.
>
> Our CouchDB installation is fairly large. =A0We have 7 production databas=
es,
> totaling almost 250GB. =A0The largest database is 129GB. =A0We are runnin=
g
> CouchDB 0.9.0 on Red Hat Enterprise Server 5.3. =A0As far as usage goes, =
we
> are constantly inserting documents into the database (5,000 at a time via=
 a
> bulk insert), and pausing to regenerate the views after 100,000 documents
> have been inserted. =A0Besides for the process that does the inserts, all
> views are accessed using stale=3Dok.
>
> Has anybody else faced a similar issue? =A0Can anybody suggest tips regar=
ding
> how I should go about diagnosing this issue?
>

Just a guess, based on the information available here, but the
enotconn error suggests that the remote client is dropping the
connection prematurely. There is an old bug about this in the tracker,
which might be a good thing to reopen if we learn much more about the
issue (and it is still present in trunk / 0.10):

http://issues.apache.org/jira/browse/COUCHDB-45

There is also this open bug which could be related:

https://issues.apache.org/jira/browse/COUCHDB-394

Perhaps you have clients who aren't properly closing the connection,
and them somehow this is running up against a limit in the underlying
server system (max number of connections, or maybe even max number of
erlang processes in the vm).

It would be nice to get to the bottom of this one, eventually.

The first step I'd suggest taking is attempting to reproduce on the
0.10.x branch from svn. This will at least tell us if the bug has been
fixed. If it's still around and repeatable, that will give us a test
case for finally crushing it into oblivion.

It might help to know more about which client library you are using,
as this bug seems to depend on the TCP behavior of clients.

Chris

> Thanks,
> John
>
> --
> John Wood
> Interactive Mediums
> john@interactivemediums.com
>


--=20
Chris Anderson
http://jchrisa.net
http://couch.io