Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of eronenp@gmail.com designates
 74.125.82.54 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:cc:content-type
         :content-transfer-encoding;
        b=teID6/clUZ6eb088CX1DOGywcNKSXY2tUTUfFRJMhhM628Wj5duEJFvNv1ZFe/bzjx
         R1/RTs8JILnfIv1gqKZ2fUlFFnqtxGEmv1raIo0gxx8dpXM5VSvR8z/qzwA/s1P9LfLF
         3eYWX7XsTe5bmqeiXes5foLSNs3re/w7M5mxc=
MIME-Version: 1.0
Sender: eronenp@gmail.com
In-Reply-To: <AANLkTi=u-KBGbt4RJ_LRHNQL3SaWWSqY_seEiJZXEjWL@mail.gmail.com>
References: <AANLkTinetwOQAre+4nnyJGuPog4N1s5NtjFt24B2Kf68@mail.gmail.com>
	<AANLkTi=u-KBGbt4RJ_LRHNQL3SaWWSqY_seEiJZXEjWL@mail.gmail.com>
Date: Sat, 26 Feb 2011 00:43:53 +0200
Message-ID: <AANLkTikMVbX5gW_MjjrOBie3+PTkV6Mk2Xue5muib97V@mail.gmail.com>
Subject: Re: CouchDB 1.0.2 errors under load
From: Pasi Eronen <pe@iki.fi>
To: Paul Davis <paul.joseph.davis@gmail.com>
Cc: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
is indeed using a lot of file descriptors, and the number is growing.
I restarted CouchDB (and my batch job), and after 40 minutes, it was
already using 300 fds; an hour later, the figure was 600; half
an hour later, almost 700.

The default "ulimit -n" was 1024, so that adds up to couple of hours.
As a temporary fix I changed the limit to 64K, so it should work fine
for the weekend :-) (but not much longer)

Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
state. Hmm. It seems CLOSE_WAIT means the client closed the connection
but CouchDB didn't call close() to release the fd? And that's presumably
a bug somewhere in CouchDB, right?

After the initial error message, the log had ~300 almost identical
error messages of the form (with different number after pread_iolist):

[Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
Uncaught error in HTTP request: {exit,
                                 {normal,
                                  {gen_server,call,
                                   [<0.288.0>,
                                    {pread_iolist,812881959},
                                    infinity]}}}

Does this give any hints where the bug might be?

If not, I'll see if I can reproduce the bug somehow in more controlled
setting. (I have a hunch that this might involve the use of HTTP
expect/continue or POST requests with chunked transfer encoding,
but let's see....)

Best regards,
Pasi

On Fri, Feb 25, 2011 at 19:01, Paul Davis <paul.joseph.davis@gmail.com> wro=
te:
> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <pe@iki.fi> wrote:
>> Hi,
>>
>> I had a big batch job (inserting 10M+ documents and generating views for=
 them)
>> that ran just fine for about 6 hours, but then I got this error:
>>
>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
>> <0.276.0> terminating
>> ** Last message in was delayed_commit
>> ** When Server state =3D=3D {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391=
489">>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0<0.273.0>,<0.277.=
0>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{db_header,5,7398=
28,0,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47786130=
11,{663866,0}},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47786149=
54,663866},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nil,0,nil=
,nil,1000},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0739828,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{btree,<0.273.0>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47787727=
55,{663866,0}},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_db_updater.7.10053969>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_db_updater.8.35220795>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_btree.5.124754102>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_db_updater.9.107593676>},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{btree,<0.273.0>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{47787746=
98,663866},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_db_updater.10.30996817>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_db_updater.11.96515267>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_btree.5.124754102>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_db_updater.12.117826253>},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{btree,<0.273.0>,=
nil,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_btree.0.83553141>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_btree.1.30790806>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Fun<couc=
h_btree.2.124754102>,nil},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0739831,<<"foo_rep=
lication_tmp">>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"/data/foo/couchd=
b-data/foo_replication_tmp.couch",
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[],[],nil,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{user_ctx,null,[]=
,undefined},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#Ref<0.0.1793.256=
453>,1000,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[before_header,af=
ter_header,on_file_open],
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0false}
>> ** Reason for termination =3D=3D
>> ** {{badmatch,{error,emfile}},
>> =A0 =A0[{couch_file,sync,1},
>> =A0 =A0 {couch_db_updater,commit_data,2},
>> =A0 =A0 {couch_db_updater,handle_info,2},
>> =A0 =A0 {gen_server,handle_msg,5},
>> =A0 =A0 {proc_lib,init_p_do_apply,3}]}
>>
>> (+lot of other messages with the same timestamp -- can send if they're u=
seful)
>>
>> Exactly at this time, the client got HTTP 500 status code; the request
>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=3Dtrue)=
.
>>
>> Just before this request, the client had made a PUT (updating an existin=
g
>> document) that got 200 status code, but apparently was not successfully
>> committed to the disk (I'm using "delayed_commits=3Dtrue" - for my app,
>> this is just fine). The client had received the new _rev value, but when
>> it tried updating the same document a minute later, there was a conflict
>> (and it's not possible that somebody else updated this same document).
>>
>> About four hours later, there was a different error ("accept_failed"
>> sounds like some temporary problem with sockets?):
>>
>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.3=
1.0>,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0{<0.20693.4>,std_error,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{application,mochiweb},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"Accept failed error","{error,emfile}"]}}
>>
>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.3=
1.0>,
>> =A0 =A0{<0.20693.4>,crash_report,
>> =A0 =A0 [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument=
__1']}},
>> =A0 =A0 =A0 {pid,<0.20693.4>},
>> =A0 =A0 =A0 {registered_name,[]},
>> =A0 =A0 =A0 {error_info,
>> =A0 =A0 =A0 =A0 =A0 {exit,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 {error,accept_failed},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{mochiweb_socket_server,acceptor_loop,1},
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{proc_lib,init_p_do_apply,3}]}},
>> =A0 =A0 =A0 {ancestors,
>> =A0 =A0 =A0 =A0 =A0 [couch_httpd,couch_secondary_services,couch_server_s=
up,<0.32.0>]},
>> =A0 =A0 =A0 {messages,[]},
>> =A0 =A0 =A0 {links,[<0.106.0>]},
>> =A0 =A0 =A0 {dictionary,[]},
>> =A0 =A0 =A0 {trap_exit,false},
>> =A0 =A0 =A0 {status,running},
>> =A0 =A0 =A0 {heap_size,233},
>> =A0 =A0 =A0 {stack_size,24},
>> =A0 =A0 =A0 {reductions,200}],
>> =A0 =A0 =A0[]]}}
>>
>> (+lots of other messages within the next couple of minutes)
>>
>> The same error occured once more, about four hours later.
>>
>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
>> what these error messages mean. (BTW, are these something I should
>> report as bugs in JIRA? I can do that, but I'd like to at least understa=
nd
>> which parts of the error messages are actually relevant here :-)
>>
>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>>
>> Best regards,
>> Pasi
>>
>
> The error you're getting is because CouchDB is running out of
> available file descriptors to use. Try increasing the limit for the
> user running CouchDB.
>