From user-return-15423-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Fri Mar 25 08:59:52 2011 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 95880 invoked from network); 25 Mar 2011 08:59:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Mar 2011 08:59:51 -0000 Received: (qmail 38394 invoked by uid 500); 25 Mar 2011 08:59:50 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 38328 invoked by uid 500); 25 Mar 2011 08:59:49 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 38319 invoked by uid 99); 25 Mar 2011 08:59:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Mar 2011 08:59:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eronenp@gmail.com designates 209.85.216.173 as permitted sender) Received: from [209.85.216.173] (HELO mail-qy0-f173.google.com) (209.85.216.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Mar 2011 08:59:42 +0000 Received: by qyk36 with SMTP id 36so6217463qyk.11 for ; Fri, 25 Mar 2011 01:59:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=yCTvWRLMe55qvw1XIMBFxgux9fcLyBM/6v+9Bu7P1Po=; b=BPE8Pmx0ogDmswfw5KZxoX3bqv/u3EkZ8PSlrhNfplYithjkrv034f7c/YmNeImNjX FhK1eOul1f7+Ez7D4RhDJS6XgpvVgOMnJiIsVWc6eYOR9GhZEvXk4yYulk5iw4e9UHAK 8RhgrCOUmK3wo1IEtgSnE9GHKih4GBAMoVqrc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=tFnneGljD33mK+AMhuptSPqNqGDzr+Ot+Lx10fYfhiS7fI3/dfVfGIFul5E9HBPrBz sOI/71WxUP8PwzjoubgcJ6SZCB+kBHVoNdcHTHsfeUNN+Dz/t2G/zCOuepf5RynTtIZ1 nbveOMtMNtoozB16rh6R1tjM9u7y52DpSnM4k= MIME-Version: 1.0 Received: by 10.224.104.80 with SMTP id n16mr379231qao.361.1301043560908; Fri, 25 Mar 2011 01:59:20 -0700 (PDT) Sender: eronenp@gmail.com Received: by 10.224.74.20 with HTTP; Fri, 25 Mar 2011 01:59:20 -0700 (PDT) In-Reply-To: References: <6B1EBDF9-1954-4B89-BFBE-74AA98008A1F@gmail.com> Date: Fri, 25 Mar 2011 10:59:20 +0200 X-Google-Sender-Auth: G9uvXmOm6oHNhW5EH0yl0jDJwGY Message-ID: Subject: Re: CouchDB 1.0.2 errors under load From: Pasi Eronen To: "Paul J. Davis" Cc: "user@couchdb.apache.org" Content-Type: multipart/alternative; boundary=bcaec5431d2a4905ac049f4acfc8 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec5431d2a4905ac049f4acfc8 Content-Type: text/plain; charset=ISO-8859-1 I've finally managed to reproduce this problem in a test environment, and filed a bug report: https://issues.apache.org/jira/browse/COUCHDB-1100 Best regards, Pasi On Sat, Feb 26, 2011 at 11:19, Pasi Eronen wrote: > BTW, it seems just changing "ulimit -n" did not work. At some point, > the server just stopped responding completely (even the root URL just > times out, and there's nothing in the error log). > > A quick googling found references to enabling kernel poll (at build time > and on erl command line), and increasing ERL_MAX_PORTS > environment variable. Based on a quick look it seems kernel poll > was already enabled, but I'll experiment with ERL_MAX_PORTS... > > Best regards, > Pasi > > > > On Sat, Feb 26, 2011 at 03:40, Paul J. Davis wrote: > >> >> >> On Feb 25, 2011, at 7:31 PM, Pasi Eronen wrote: >> >> Yes, the client is doing something funky :-) Many of the requests >> it sends are POSTs with chunked transfer encoding, so it's being >> hit hard by issue COUCHDB-682. In other words, there's lot of closing >> and opening of TCP connections going on, and retrying requests that >> failed due to COUCHDB-682. >> >> (The server behavior in COUCHDB-682 is probably strictly speaking not >> violating the HTTP spec -- the server is allowed to close an idle >> persistent >> connection at any time -- but it's certainly causing unusual behavior >> vs. opening and closing of TCP connections.) >> >> Output from netstat -anp looks like this: >> >> tcp 1 0 127.0.0.1:5984 127.0.0.1:51681 CLOSE_WAIT 19124/beam.smp >> tcp 1 0 127.0.0.1:5984 127.0.0.1:36576 CLOSE_WAIT 19124/beam.smp >> tcp 1 0 127.0.0.1:5984 127.0.0.1:35311 CLOSE_WAIT 19124/beam.smp >> (...) >> >> The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no >> longer >> around (if they would be, I guess the client would also run out of fds -- >> and >> the number of fds used by the client is not growing). >> >> Does this offer any clues for diagnosing the bug? >> >> >> Its definitely pointing the finger at CouchDB/erlang. >> >> I can also try if I can reproduce the fd leak with some simple test >> program... >> >> >> Please do. I'd start by trying to mimic your scripts main request thing to >> see if you can trigger individual sockets to end up in close_wait. >> >> Best regards, >> Pasi >> >> On Sat, Feb 26, 2011 at 01:11, Paul Davis < >> paul.joseph.davis@gmail.com> wrote: >> > It sounds like your client is doing something funky with shutting down >> > connections to the server. If I were to guess I would say that it >> > looks like you're leaking sockets between a call to shutdown and a >> > call to close. Though to say for certain I'd need to see what state >> > both ends of the socket are in. >> > >> > The errors about pread_iolist are hard to diagnose without the >> > surrounding error messages but if I were to guess is that you're >> > getting an emfile that cascades to the file gen_server which then >> > notifies other people trying to issue read calls. >> > >> > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen < pe@iki.fi> >> wrote: >> >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB >> >> is indeed using a lot of file descriptors, and the number is growing. >> >> I restarted CouchDB (and my batch job), and after 40 minutes, it was >> >> already using 300 fds; an hour later, the figure was 600; half >> >> an hour later, almost 700. >> >> >> >> The default "ulimit -n" was 1024, so that adds up to couple of hours. >> >> As a temporary fix I changed the limit to 64K, so it should work fine >> >> for the weekend :-) (but not much longer) >> >> >> >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT >> >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection >> >> but CouchDB didn't call close() to release the fd? And that's >> presumably >> >> a bug somewhere in CouchDB, right? >> >> >> >> After the initial error message, the log had ~300 almost identical >> >> error messages of the form (with different number after pread_iolist): >> >> >> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>] >> >> Uncaught error in HTTP request: {exit, >> >> {normal, >> >> {gen_server,call, >> >> [<0.288.0>, >> >> {pread_iolist,812881959}, >> >> infinity]}}} >> >> >> >> Does this give any hints where the bug might be? >> >> >> >> If not, I'll see if I can reproduce the bug somehow in more controlled >> >> setting. (I have a hunch that this might involve the use of HTTP >> >> expect/continue or POST requests with chunked transfer encoding, >> >> but let's see....) >> >> >> >> Best regards, >> >> Pasi >> >> >> >> On Fri, Feb 25, 2011 at 19:01, Paul Davis < >> paul.joseph.davis@gmail.com> wrote: >> >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen < pe@iki.fi> >> wrote: >> >>>> Hi, >> >>>> >> >>>> I had a big batch job (inserting 10M+ documents and generating views >> for them) >> >>>> that ran just fine for about 6 hours, but then I got this error: >> >>>> >> >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server >> >>>> <0.276.0> terminating >> >>>> ** Last message in was delayed_commit >> >>>> ** When Server state == >> {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>, >> >>>> <0.273.0>,<0.277.0>, >> >>>> {db_header,5,739828,0, >> >>>> {4778613011,{663866,0}}, >> >>>> {4778614954,663866}, >> >>>> nil,0,nil,nil,1000}, >> >>>> 739828, >> >>>> {btree,<0.273.0>, >> >>>> {4778772755,{663866,0}}, >> >>>> #Fun, >> >>>> #Fun, >> >>>> #Fun, >> >>>> #Fun}, >> >>>> {btree,<0.273.0>, >> >>>> {4778774698,663866}, >> >>>> #Fun, >> >>>> #Fun, >> >>>> #Fun, >> >>>> #Fun}, >> >>>> {btree,<0.273.0>,nil, >> >>>> #Fun, >> >>>> #Fun, >> >>>> #Fun,nil}, >> >>>> 739831,<<"foo_replication_tmp">>, >> >>>> >> "/data/foo/couchdb-data/foo_replication_tmp.couch", >> >>>> [],[],nil, >> >>>> {user_ctx,null,[],undefined}, >> >>>> #Ref<0.0.1793.256453>,1000, >> >>>> [before_header,after_header,on_file_open], >> >>>> false} >> >>>> ** Reason for termination == >> >>>> ** {{badmatch,{error,emfile}}, >> >>>> [{couch_file,sync,1}, >> >>>> {couch_db_updater,commit_data,2}, >> >>>> {couch_db_updater,handle_info,2}, >> >>>> {gen_server,handle_msg,5}, >> >>>> {proc_lib,init_p_do_apply,3}]} >> >>>> >> >>>> (+lot of other messages with the same timestamp -- can send if >> they're useful) >> >>>> >> >>>> Exactly at this time, the client got HTTP 500 status code; the >> request >> >>>> was a bulk get (POST >> /foo_replication_tmp/_all_docs?include_docs=true). >> >>>> >> >>>> Just before this request, the client had made a PUT (updating an >> existing >> >>>> document) that got 200 status code, but apparently was not >> successfully >> >>>> committed to the disk (I'm using "delayed_commits=true" - for my app, >> >>>> this is just fine). The client had received the new _rev value, but >> when >> >>>> it tried updating the same document a minute later, there was a >> conflict >> >>>> (and it's not possible that somebody else updated this same >> document). >> >>>> >> >>>> About four hours later, there was a different error ("accept_failed" >> >>>> sounds like some temporary problem with sockets?): >> >>>> >> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] >> {error_report,<0.31.0>, >> >>>> {<0.20693.4>,std_error, >> >>>> [{application,mochiweb}, >> >>>> "Accept failed error","{error,emfile}"]}} >> >>>> >> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] >> {error_report,<0.31.0>, >> >>>> {<0.20693.4>,crash_report, >> >>>> >> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}}, >> >>>> {pid,<0.20693.4>}, >> >>>> {registered_name,[]}, >> >>>> {error_info, >> >>>> {exit, >> >>>> {error,accept_failed}, >> >>>> [{mochiweb_socket_server,acceptor_loop,1}, >> >>>> {proc_lib,init_p_do_apply,3}]}}, >> >>>> {ancestors, >> >>>> >> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]}, >> >>>> {messages,[]}, >> >>>> {links,[<0.106.0>]}, >> >>>> {dictionary,[]}, >> >>>> {trap_exit,false}, >> >>>> {status,running}, >> >>>> {heap_size,233}, >> >>>> {stack_size,24}, >> >>>> {reductions,200}], >> >>>> []]}} >> >>>> >> >>>> (+lots of other messages within the next couple of minutes) >> >>>> >> >>>> The same error occured once more, about four hours later. >> >>>> >> >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting >> >>>> what these error messages mean. (BTW, are these something I should >> >>>> report as bugs in JIRA? I can do that, but I'd like to at least >> understand >> >>>> which parts of the error messages are actually relevant here :-) >> >>>> >> >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6. >> >>>> >> >>>> Best regards, >> >>>> Pasi >> >>>> >> >>> >> >>> The error you're getting is because CouchDB is running out of >> >>> available file descriptors to use. Try increasing the limit for the >> >>> user running CouchDB. >> >>> >> >> >> > >> >> > --bcaec5431d2a4905ac049f4acfc8--