couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: CouchDB 1.0.2 errors under load
Date Fri, 25 Feb 2011 23:11:12 GMT
It sounds like your client is doing something funky with shutting down
connections to the server. If I were to guess I would say that it
looks like you're leaking sockets between a call to shutdown and a
call to close. Though to say for certain I'd need to see what state
both ends of the socket are in.

The errors about pread_iolist are hard to diagnose without the
surrounding error messages but if I were to guess is that you're
getting an emfile that cascades to the file gen_server which then
notifies other people trying to issue read calls.

On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen <pe@iki.fi> wrote:
> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
> is indeed using a lot of file descriptors, and the number is growing.
> I restarted CouchDB (and my batch job), and after 40 minutes, it was
> already using 300 fds; an hour later, the figure was 600; half
> an hour later, almost 700.
>
> The default "ulimit -n" was 1024, so that adds up to couple of hours.
> As a temporary fix I changed the limit to 64K, so it should work fine
> for the weekend :-) (but not much longer)
>
> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
> state. Hmm. It seems CLOSE_WAIT means the client closed the connection
> but CouchDB didn't call close() to release the fd? And that's presumably
> a bug somewhere in CouchDB, right?
>
> After the initial error message, the log had ~300 almost identical
> error messages of the form (with different number after pread_iolist):
>
> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
> Uncaught error in HTTP request: {exit,
>                                 {normal,
>                                  {gen_server,call,
>                                   [<0.288.0>,
>                                    {pread_iolist,812881959},
>                                    infinity]}}}
>
> Does this give any hints where the bug might be?
>
> If not, I'll see if I can reproduce the bug somehow in more controlled
> setting. (I have a hunch that this might involve the use of HTTP
> expect/continue or POST requests with chunked transfer encoding,
> but let's see....)
>
> Best regards,
> Pasi
>
> On Fri, Feb 25, 2011 at 19:01, Paul Davis <paul.joseph.davis@gmail.com> wrote:
>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <pe@iki.fi> wrote:
>>> Hi,
>>>
>>> I had a big batch job (inserting 10M+ documents and generating views for them)
>>> that ran just fine for about 6 hours, but then I got this error:
>>>
>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
>>> <0.276.0> terminating
>>> ** Last message in was delayed_commit
>>> ** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
>>>                            <0.273.0>,<0.277.0>,
>>>                            {db_header,5,739828,0,
>>>                                {4778613011,{663866,0}},
>>>                                {4778614954,663866},
>>>                                nil,0,nil,nil,1000},
>>>                            739828,
>>>                            {btree,<0.273.0>,
>>>                                {4778772755,{663866,0}},
>>>                                #Fun<couch_db_updater.7.10053969>,
>>>                                #Fun<couch_db_updater.8.35220795>,
>>>                                #Fun<couch_btree.5.124754102>,
>>>                                #Fun<couch_db_updater.9.107593676>},
>>>                            {btree,<0.273.0>,
>>>                                {4778774698,663866},
>>>                                #Fun<couch_db_updater.10.30996817>,
>>>                                #Fun<couch_db_updater.11.96515267>,
>>>                                #Fun<couch_btree.5.124754102>,
>>>                                #Fun<couch_db_updater.12.117826253>},
>>>                            {btree,<0.273.0>,nil,
>>>                                #Fun<couch_btree.0.83553141>,
>>>                                #Fun<couch_btree.1.30790806>,
>>>                                #Fun<couch_btree.2.124754102>,nil},
>>>                            739831,<<"foo_replication_tmp">>,
>>>                            "/data/foo/couchdb-data/foo_replication_tmp.couch",
>>>                            [],[],nil,
>>>                            {user_ctx,null,[],undefined},
>>>                            #Ref<0.0.1793.256453>,1000,
>>>                            [before_header,after_header,on_file_open],
>>>                            false}
>>> ** Reason for termination ==
>>> ** {{badmatch,{error,emfile}},
>>>    [{couch_file,sync,1},
>>>     {couch_db_updater,commit_data,2},
>>>     {couch_db_updater,handle_info,2},
>>>     {gen_server,handle_msg,5},
>>>     {proc_lib,init_p_do_apply,3}]}
>>>
>>> (+lot of other messages with the same timestamp -- can send if they're useful)
>>>
>>> Exactly at this time, the client got HTTP 500 status code; the request
>>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true).
>>>
>>> Just before this request, the client had made a PUT (updating an existing
>>> document) that got 200 status code, but apparently was not successfully
>>> committed to the disk (I'm using "delayed_commits=true" - for my app,
>>> this is just fine). The client had received the new _rev value, but when
>>> it tried updating the same document a minute later, there was a conflict
>>> (and it's not possible that somebody else updated this same document).
>>>
>>> About four hours later, there was a different error ("accept_failed"
>>> sounds like some temporary problem with sockets?):
>>>
>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>>>              {<0.20693.4>,std_error,
>>>               [{application,mochiweb},
>>>                "Accept failed error","{error,emfile}"]}}
>>>
>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>>>    {<0.20693.4>,crash_report,
>>>     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>>>       {pid,<0.20693.4>},
>>>       {registered_name,[]},
>>>       {error_info,
>>>           {exit,
>>>               {error,accept_failed},
>>>               [{mochiweb_socket_server,acceptor_loop,1},
>>>                {proc_lib,init_p_do_apply,3}]}},
>>>       {ancestors,
>>>           [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
>>>       {messages,[]},
>>>       {links,[<0.106.0>]},
>>>       {dictionary,[]},
>>>       {trap_exit,false},
>>>       {status,running},
>>>       {heap_size,233},
>>>       {stack_size,24},
>>>       {reductions,200}],
>>>      []]}}
>>>
>>> (+lots of other messages within the next couple of minutes)
>>>
>>> The same error occured once more, about four hours later.
>>>
>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
>>> what these error messages mean. (BTW, are these something I should
>>> report as bugs in JIRA? I can do that, but I'd like to at least understand
>>> which parts of the error messages are actually relevant here :-)
>>>
>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>>>
>>> Best regards,
>>> Pasi
>>>
>>
>> The error you're getting is because CouchDB is running out of
>> available file descriptors to use. Try increasing the limit for the
>> user running CouchDB.
>>
>

Mime
View raw message