Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 87294 invoked from network); 25 Feb 2011 22:44:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Feb 2011 22:44:23 -0000 Received: (qmail 61200 invoked by uid 500); 25 Feb 2011 22:44:21 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 61137 invoked by uid 500); 25 Feb 2011 22:44:20 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 61129 invoked by uid 99); 25 Feb 2011 22:44:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Feb 2011 22:44:20 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of eronenp@gmail.com designates 74.125.82.54 as permitted sender) Received: from [74.125.82.54] (HELO mail-ww0-f54.google.com) (74.125.82.54) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Feb 2011 22:44:15 +0000 Received: by wwi18 with SMTP id 18so2460054wwi.23 for ; Fri, 25 Feb 2011 14:43:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=jJ6UsEpUXunf25jc1WMl/b3wyXnG6JCYuX2usomVyIQ=; b=eFQeHrpYcVzbnqjFcoirksgeA7K56HCtaORhG7+ibh7RRa87QE07ISu7Cu5SA8XgR0 cIQgCP9pQQPtRobVPL1MrgCsgUJvBiLqXY5eIICxu5w1jPTDPhdw6Tj/PlIvp6bds/YR jhk6W2sGyTWD4qXlj0JKhrwblggzy8QXp+gDY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=teID6/clUZ6eb088CX1DOGywcNKSXY2tUTUfFRJMhhM628Wj5duEJFvNv1ZFe/bzjx R1/RTs8JILnfIv1gqKZ2fUlFFnqtxGEmv1raIo0gxx8dpXM5VSvR8z/qzwA/s1P9LfLF 3eYWX7XsTe5bmqeiXes5foLSNs3re/w7M5mxc= MIME-Version: 1.0 Received: by 10.227.32.73 with SMTP id b9mr2613522wbd.201.1298673833947; Fri, 25 Feb 2011 14:43:53 -0800 (PST) Sender: eronenp@gmail.com Received: by 10.227.159.69 with HTTP; Fri, 25 Feb 2011 14:43:53 -0800 (PST) In-Reply-To: References: Date: Sat, 26 Feb 2011 00:43:53 +0200 X-Google-Sender-Auth: W4-oVTtQA-5Q7XywwMb4NfD5UDk Message-ID: Subject: Re: CouchDB 1.0.2 errors under load From: Pasi Eronen To: Paul Davis Cc: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB is indeed using a lot of file descriptors, and the number is growing. I restarted CouchDB (and my batch job), and after 40 minutes, it was already using 300 fds; an hour later, the figure was 600; half an hour later, almost 700. The default "ulimit -n" was 1024, so that adds up to couple of hours. As a temporary fix I changed the limit to 64K, so it should work fine for the weekend :-) (but not much longer) Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT state. Hmm. It seems CLOSE_WAIT means the client closed the connection but CouchDB didn't call close() to release the fd? And that's presumably a bug somewhere in CouchDB, right? After the initial error message, the log had ~300 almost identical error messages of the form (with different number after pread_iolist): [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>] Uncaught error in HTTP request: {exit, {normal, {gen_server,call, [<0.288.0>, {pread_iolist,812881959}, infinity]}}} Does this give any hints where the bug might be? If not, I'll see if I can reproduce the bug somehow in more controlled setting. (I have a hunch that this might involve the use of HTTP expect/continue or POST requests with chunked transfer encoding, but let's see....) Best regards, Pasi On Fri, Feb 25, 2011 at 19:01, Paul Davis wro= te: > On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen wrote: >> Hi, >> >> I had a big batch job (inserting 10M+ documents and generating views for= them) >> that ran just fine for about 6 hours, but then I got this error: >> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server >> <0.276.0> terminating >> ** Last message in was delayed_commit >> ** When Server state =3D=3D {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391= 489">>, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0<0.273.0>,<0.277.= 0>, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{db_header,5,7398= 28,0, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47786130= 11,{663866,0}}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47786149= 54,663866}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nil,0,nil= ,nil,1000}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0739828, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{btree,<0.273.0>, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47787727= 55,{663866,0}}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{btree,<0.273.0>, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47787746= 98,663866}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{btree,<0.273.0>,= nil, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun,nil}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0739831,<<"foo_rep= lication_tmp">>, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"/data/foo/couchd= b-data/foo_replication_tmp.couch", >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[],[],nil, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{user_ctx,null,[]= ,undefined}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Ref<0.0.1793.256= 453>,1000, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[before_header,af= ter_header,on_file_open], >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0false} >> ** Reason for termination =3D=3D >> ** {{badmatch,{error,emfile}}, >> =A0 =A0[{couch_file,sync,1}, >> =A0 =A0 {couch_db_updater,commit_data,2}, >> =A0 =A0 {couch_db_updater,handle_info,2}, >> =A0 =A0 {gen_server,handle_msg,5}, >> =A0 =A0 {proc_lib,init_p_do_apply,3}]} >> >> (+lot of other messages with the same timestamp -- can send if they're u= seful) >> >> Exactly at this time, the client got HTTP 500 status code; the request >> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=3Dtrue)= . >> >> Just before this request, the client had made a PUT (updating an existin= g >> document) that got 200 status code, but apparently was not successfully >> committed to the disk (I'm using "delayed_commits=3Dtrue" - for my app, >> this is just fine). The client had received the new _rev value, but when >> it tried updating the same document a minute later, there was a conflict >> (and it's not possible that somebody else updated this same document). >> >> About four hours later, there was a different error ("accept_failed" >> sounds like some temporary problem with sockets?): >> >> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.3= 1.0>, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0{<0.20693.4>,std_error, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{application,mochiweb}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"Accept failed error","{error,emfile}"]}} >> >> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.3= 1.0>, >> =A0 =A0{<0.20693.4>,crash_report, >> =A0 =A0 [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument= __1']}}, >> =A0 =A0 =A0 {pid,<0.20693.4>}, >> =A0 =A0 =A0 {registered_name,[]}, >> =A0 =A0 =A0 {error_info, >> =A0 =A0 =A0 =A0 =A0 {exit, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 {error,accept_failed}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{mochiweb_socket_server,acceptor_loop,1}, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{proc_lib,init_p_do_apply,3}]}}, >> =A0 =A0 =A0 {ancestors, >> =A0 =A0 =A0 =A0 =A0 [couch_httpd,couch_secondary_services,couch_server_s= up,<0.32.0>]}, >> =A0 =A0 =A0 {messages,[]}, >> =A0 =A0 =A0 {links,[<0.106.0>]}, >> =A0 =A0 =A0 {dictionary,[]}, >> =A0 =A0 =A0 {trap_exit,false}, >> =A0 =A0 =A0 {status,running}, >> =A0 =A0 =A0 {heap_size,233}, >> =A0 =A0 =A0 {stack_size,24}, >> =A0 =A0 =A0 {reductions,200}], >> =A0 =A0 =A0[]]}} >> >> (+lots of other messages within the next couple of minutes) >> >> The same error occured once more, about four hours later. >> >> I'm quite new to CouchDB, so I'd appreciate any help in interpreting >> what these error messages mean. (BTW, are these something I should >> report as bugs in JIRA? I can do that, but I'd like to at least understa= nd >> which parts of the error messages are actually relevant here :-) >> >> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6. >> >> Best regards, >> Pasi >> > > The error you're getting is because CouchDB is running out of > available file descriptors to use. Try increasing the limit for the > user running CouchDB. >