Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 62725 invoked from network); 1 Sep 2009 17:50:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Sep 2009 17:50:08 -0000 Received: (qmail 39274 invoked by uid 500); 1 Sep 2009 17:50:06 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 39142 invoked by uid 500); 1 Sep 2009 17:50:06 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 38736 invoked by uid 99); 1 Sep 2009 17:49:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Sep 2009 17:49:50 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jchris@gmail.com designates 209.85.212.171 as permitted sender) Received: from [209.85.212.171] (HELO mail-vw0-f171.google.com) (209.85.212.171) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Sep 2009 17:49:40 +0000 Received: by vws1 with SMTP id 1so229175vws.13 for ; Tue, 01 Sep 2009 10:49:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=T++ly3x1N65QTYtsQj+1paGvUaUQvq7r6BZLRhcBNnc=; b=Ogk8oBb0WTwaSniKwbDcCdXWG3H8jJspQI3PoqCH+EKtU1UxFWg/roOSvk2s6N5UFB yO/h3SOgjHQYzAfUbtg3m2zDBpaGJ4FChWMzwdQ5h77aJfXZTCmdtgWb7aLnCSDRqhRB OHeaObDPbEbxbNeVlH1VgGd8NEFB1uflXDrpE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=YwlikXyB7JXyXkXyG5OKlb57FuCJ9jMrnK3AwdE8ubngeWhCXAnuoxTPsOZNmiVQOb GC7bh/OCqiSdH7XRhTt9MNvAlfckRK3KoGz0a6hqgB5KQSaqs3fej+QOEGV4pbzIj8n1 IrGqrY2ulVANKzX5yjWtWwhTRXcpCg4dZRtg0= MIME-Version: 1.0 Sender: jchris@gmail.com Received: by 10.220.107.99 with SMTP id a35mr9126377vcp.45.1251827357515; Tue, 01 Sep 2009 10:49:17 -0700 (PDT) In-Reply-To: References: Date: Tue, 1 Sep 2009 10:49:17 -0700 X-Google-Sender-Auth: db2a4d1a0afb62c5 Message-ID: Subject: Re: CouchDB pegging the CPU and not responding to requests From: Chris Anderson To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Sep 1, 2009 at 7:52 AM, John Wood wrot= e: > Hi everybody, > > I'm currently facing an issue with our production installation of CouchDB= . > Two times within the past 5 days, the Erlang process running CouchDB pegs > one of the 4 cores on the machine, consumes about 40% of the system RAM > (which is 4GB), and becomes completely unresponsive to incoming HTTP > requests. =A0The only way we can get it back to normal is to restart Couc= hDB. > > I'm trying to determine what may be causing this, but I'm not having much > luck. =A0Nothing stands out in the CouchDB log files. =A0I can see that t= here > are no entries in the log files from the time it goes unresponsive until = the > time I restart it. =A0Besides that, there doesn't appear to be any errors > leading up to the issue. =A0There are however a few errors like the one b= elow, > but none right before CouchDB goes unresponsive: > > [error] [<0.11738.288>] {error_report,<0.21.0>, > =A0 =A0{<0.11738.288>,crash_report, > =A0 =A0 [[{pid,<0.11738.288>}, > =A0 =A0 =A0 {registered_name,[]}, > =A0 =A0 =A0 {error_info, > =A0 =A0 =A0 =A0 =A0 {error, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 {case_clause,{error,enotconn}}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{mochiweb_request,get,2}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{couch_httpd,handle_request,4}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{mochiweb_http,headers,5}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{proc_lib,init_p,5}]}}, > =A0 =A0 =A0 {initial_call, > =A0 =A0 =A0 =A0 =A0 {mochiweb_socket_server,acceptor_loop, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{<0.56.0>,#Port<0.148>,#Fun}]}}, > =A0 =A0 =A0 {ancestors, > =A0 =A0 =A0 =A0 =A0 [couch_httpd,couch_secondary_services,couch_server_su= p, > =A0 =A0 =A0 =A0 =A0 =A0<0.1.0>]}, > =A0 =A0 =A0 {messages,[]}, > =A0 =A0 =A0 {links,[<0.56.0>,#Port<0.5032425>]}, > =A0 =A0 =A0 {dictionary,[{mochiweb_request_qs,[{"limit","0"}]}]}, > =A0 =A0 =A0 {trap_exit,false}, > =A0 =A0 =A0 {status,running}, > =A0 =A0 =A0 {heap_size,28657}, > =A0 =A0 =A0 {stack_size,23}, > =A0 =A0 =A0 {reductions,14034}], > =A0 =A0 =A0[]]}} > [error] [<0.56.0>] {error_report,<0.21.0>, > =A0 =A0{<0.56.0>,std_error, > =A0 =A0 {mochiweb_socket_server,235, > =A0 =A0 =A0 =A0 {child_error,{case_clause,{error,enotconn}}}}}} > > =3DERROR REPORT=3D=3D=3D=3D 30-Aug-2009::04:29:07 =3D=3D=3D > {mochiweb_socket_server,235, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{child_error,{case_clause,= {error,enotconn}}}} > > I checked some of the other system log files (/var/log/messages, etc), an= d > there doesn't appear to be any information there either. > > Our CouchDB installation is fairly large. =A0We have 7 production databas= es, > totaling almost 250GB. =A0The largest database is 129GB. =A0We are runnin= g > CouchDB 0.9.0 on Red Hat Enterprise Server 5.3. =A0As far as usage goes, = we > are constantly inserting documents into the database (5,000 at a time via= a > bulk insert), and pausing to regenerate the views after 100,000 documents > have been inserted. =A0Besides for the process that does the inserts, all > views are accessed using stale=3Dok. > > Has anybody else faced a similar issue? =A0Can anybody suggest tips regar= ding > how I should go about diagnosing this issue? > Just a guess, based on the information available here, but the enotconn error suggests that the remote client is dropping the connection prematurely. There is an old bug about this in the tracker, which might be a good thing to reopen if we learn much more about the issue (and it is still present in trunk / 0.10): http://issues.apache.org/jira/browse/COUCHDB-45 There is also this open bug which could be related: https://issues.apache.org/jira/browse/COUCHDB-394 Perhaps you have clients who aren't properly closing the connection, and them somehow this is running up against a limit in the underlying server system (max number of connections, or maybe even max number of erlang processes in the vm). It would be nice to get to the bottom of this one, eventually. The first step I'd suggest taking is attempting to reproduce on the 0.10.x branch from svn. This will at least tell us if the bug has been fixed. If it's still around and repeatable, that will give us a test case for finally crushing it into oblivion. It might help to know more about which client library you are using, as this bug seems to depend on the TCP behavior of clients. Chris > Thanks, > John > > -- > John Wood > Interactive Mediums > john@interactivemediums.com > --=20 Chris Anderson http://jchrisa.net http://couch.io