Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B581270F8 for ; Fri, 9 Dec 2011 12:24:51 +0000 (UTC) Received: (qmail 16868 invoked by uid 500); 9 Dec 2011 12:24:49 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 16840 invoked by uid 500); 9 Dec 2011 12:24:49 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 16832 invoked by uid 99); 9 Dec 2011 12:24:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Dec 2011 12:24:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hristo@deshev.com designates 74.125.82.54 as permitted sender) Received: from [74.125.82.54] (HELO mail-ww0-f54.google.com) (74.125.82.54) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Dec 2011 12:24:37 +0000 Received: by wgbdt13 with SMTP id dt13so1530230wgb.23 for ; Fri, 09 Dec 2011 04:24:17 -0800 (PST) MIME-Version: 1.0 Received: by 10.227.208.129 with SMTP id gc1mr7721922wbb.4.1323433457265; Fri, 09 Dec 2011 04:24:17 -0800 (PST) Received: by 10.216.30.203 with HTTP; Fri, 9 Dec 2011 04:24:17 -0800 (PST) X-Originating-IP: [92.247.239.141] In-Reply-To: <4EE1D088.7070908@gmail.com> References: <4EE1D088.7070908@gmail.com> Date: Fri, 9 Dec 2011 14:24:17 +0200 Message-ID: Subject: Re: CouchDB 1.1.1 mysteriously crashing under heavy load From: Hristo Deshev To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=0015174a050c1aacfb04b3a7dd50 X-Virus-Checked: Checked by ClamAV on apache.org --0015174a050c1aacfb04b3a7dd50 Content-Type: text/plain; charset=UTF-8 Hi CGS, Thanks for your help! I dug deeper in that log file and found a lot of "OS process timed out." errors generated both by the <0.86.0> process working with one database and other processes working with other databases. I think the process in question is the python script that forwards requests to couchdb-lucene. Now, right after the last timeout error, I see this: [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, {<0.86.0>,supervisor_report, [{supervisor,{local,couch_secondary_services}}, {errorContext,shutdown}, {reason,reached_max_restart_intensity}, {offender, [{pid,<0.8520.1>}, {name,external_manager}, {mfargs,{couch_external_manager,start_link,[]}}, {restart_type,permanent}, {shutdown,1000}, {child_type,worker}]}]}} ...with the "reached_max_restart_intensity" part looking fishy to me. I couldn't find any docs on that one, but I assume that the process dies too often and CouchDB gives up restarting it. But why would that bring the entire DB server down instead of "just" breaking couchdb-lucene queries? I already did some changes that should make the Lucene timeouts happen a lot less often. Is there a setting that I can change and up the restart intensity value? I checked that the disk space and max file sizes are okay. I have both enough free space and I host larger files on that filesystem. I am pasting the log file snippet below. Best, Hristo =========== [Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, {<0.86.0>,supervisor_report, [{supervisor,{local,couch_secondary_services}}, {errorContext,child_terminated}, {reason,normal}, {offender, [{pid,<0.132.0>}, {name,external_manager}, {mfargs,{couch_external_manager,start_link,[]}}, {restart_type,permanent}, {shutdown,1000}, {child_type,worker}]}]}} [Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.486.0>] Uncaught error in HTTP request: {exit, {{bad_return_value, {os_process_error, "OS process timed out."}}, {gen_server,call, [<0.525.0>, {execute, {[{<<"info">>, {[{db_name, <<"database1">>}, {doc_count, 13343011}, {doc_del_count, 0}, {update_seq, 13343013}, {purge_seq,0}, {compact_running, false}, {disk_size, 6628601611}, {instance_start_time, <<"1323371411352029">>}, {disk_format_version, 5}, {committed_update_seq, 13342813}]}}, {<<"id">>,null}, {<<"uuid">>, <<"0b482d09cd97c1bada0d53cfab9f14b2">>}, {<<"method">>, 'GET'}, {<<"requested_path">>, [<<"database1">>, <<"_fti">>, <<"_design">>, <<"log-search">>, <<"everything?q=text:error&limit=1">>]}, {<<"path">>, [<<"database1">>, <<"_fti">>, <<"_design">>, <<"log-search">>, <<"everything">>]}, {<<"query">>, {[{<<"q">>, <<"text:error">>}, {<<"limit">>, <<"1">>}]}}, {<<"headers">>, {[{<<"Connection">>, <<"Keep-Alive">>}, {<<"Host">>, <<"couch2.internal:5984">>}]}}, {<<"body">>, undefined}, {<<"peer">>, <<"127.0.0.1">>}, {<<"form">>,{[]}}, {<<"cookie">>,{[]}}, {<<"userCtx">>, {[{<<"db">>, <<"database1">>}, {<<"name">>, null}, {<<"roles">>, [<<"_admin">>]}]}}, {<<"secObj">>, {[]}}]}}, infinity]}}} ... [A LOT MORE MESSAGES LIKE THE PREVIOUS ONE] ... [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, {<0.86.0>,supervisor_report, [{supervisor,{local,couch_secondary_services}}, {errorContext,child_terminated}, {reason,normal}, {offender, [{pid,<0.8520.1>}, {name,external_manager}, {mfargs,{couch_external_manager,start_link,[]}}, {restart_type,permanent}, {shutdown,1000}, {child_type,worker}]}]}} [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, {<0.86.0>,supervisor_report, [{supervisor,{local,couch_secondary_services}}, {errorContext,shutdown}, {reason,reached_max_restart_intensity}, {offender, [{pid,<0.8520.1>}, {name,external_manager}, {mfargs,{couch_external_manager,start_link,[]}}, {restart_type,permanent}, {shutdown,1000}, {child_type,worker}]}]}} [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.486.0>] Uncaught error in HTTP request: {exit, {{bad_return_value, {os_process_error, "OS process timed out."}}, {gen_server,call, [<0.8529.1>, {execute, {[{<<"info">>, {[{db_name, <<"database2">>}, {doc_count, 7416681}, {doc_del_count, 0}, {update_seq, 7416683}, {purge_seq,0}, {compact_running, false}, {disk_size, 4226396124}, {instance_start_time, <<"1323371428396966">>}, {disk_format_version, 5}, {committed_update_seq, 7416483}]}}, {<<"id">>,null}, {<<"uuid">>, <<"87792b1884f91df492a069ec2e3d76a8">>}, {<<"method">>, 'GET'}, {<<"requested_path">>, [<<"database2">>, <<"_fti">>, <<"_design">>, <<"log-search">>, <<"everything?q=text:error&limit=1">>]}, {<<"path">>, [<<"database2">>, <<"_fti">>, <<"_design">>, <<"log-search">>, <<"everything">>]}, {<<"query">>, {[{<<"q">>, <<"text:error">>}, {<<"limit">>, <<"1">>}]}}, {<<"headers">>, {[{<<"Connection">>, <<"Keep-Alive">>}, {<<"Host">>, <<"couch2.internal:5984">>}]}}, {<<"body">>, undefined}, {<<"peer">>, <<"127.0.0.1">>}, {<<"form">>,{[]}}, {<<"cookie">>,{[]}}, {<<"userCtx">>, {[{<<"db">>, <<"database2">>}, {<<"name">>, null}, {<<"roles">>, [<<"_admin">>]}]}}, {<<"secObj">>, {[]}}]}}, infinity]}}} [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>] {error_report,<0.31.0>, {<0.78.0>,supervisor_report, [{supervisor,{local,couch_server_sup}}, {errorContext,child_terminated}, {reason,shutdown}, {offender, [{pid,<0.86.0>}, {name,couch_secondary_services}, {mfargs, {couch_server_sup,start_secondary_services, []}}, {restart_type,permanent}, {shutdown,infinity}, {child_type,supervisor}]}]}} =========== On Fri, Dec 9, 2011 at 11:10 AM, CGS wrote: > Hi Hristo, > > The problem is before what you presented from your log. For example, could > you copy-paste information about process <0.86.0>? From the look of the > log, it seems something (I couldn't find what from the part of the log you > presented, but it seems that process <0.86.0> could provide some hints > about that something) is crashing the generic server (without stopping the > application) repeatedly and that's the reason for which you cannot access > your documents. That explains why you are able to "fix" the problem by > restarting CouchDB and also your message "OS process has timed out" (which > is the effect, and not the cause of that problem). What is crashing your > server, no idea, but the log should be able to provide the necessary > information. > > It may be also a good idea to check the space on your harddisk or the > maximum file size allowed by your OS. But firstly, I would check what > information that process <0.86.0> could provide. > > CGS > > > > > > On 12/08/2011 10:41 PM, Hristo Deshev wrote: > >> Hi everyone, >> >> I moved some data from an Amazon EC2 small instance to a large one and in >> the process upgraded from CouchDB 1.1.0 to CouchDB 1.1.1. I also went with >> Erlang R14B04 instead of R14B03 (Hurray for commando updates!) and now my >> CouchDB instance seems to sometimes die when under heavy load. By "dying" >> I >> mean that the beam process seems to stay in memory, but the HTTP server is >> gone and no requests get served. For now I "fix" this by stopping and >> restarting the process. >> >> Here are some details on my setup. The server is running a 64-bit Ubuntu >> Server (Oneiric) Amazon EC2 image on a large instance with 2 CPU cores and >> 7.5 GB RAM. I build both Erlang and CouchDB from source. I collect log >> entries and bulk insert them in batches of up to 200 documents. I also run >> couchdb-lucene on the same host and I *think* most of the crashes happen >> when couchdb-lucene is running a tough query and is hogging the CPU or the >> HDD. I have some largish db's (~50 million documents, ~25 GB in disk >> space). I plan on splitting my dbs into smaller ones. I hope that gets me >> more responsive file access and faster full text index searches. I think >> my >> lucene indexes may be getting too large for that machine's memory and it >> can't serve them too well. I frequently get "OS process has timed out" >> errors when trying to query those indexes. Anyway, that shouldn't be >> crashing the core couchdb process, right? >> >> I am pasting my idea of what the relevant portion of the couchdb log file >> is below, hoping somebody could decipher something out of it. Am I correct >> in thinking that the "** Reason for termination == ** {timeout," part >> means >> the process is crashing since writing to or reading from a file timed out? >> Any help is greatly appreciated. >> >> Best, >> Hristo >> >> =============== >> >> [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>] {error_report,<0.31.0>, >> {<0.78.0>,supervisor_report, >> [{supervisor,{local,couch_**server_sup}}, >> {errorContext,child_**terminated}, >> {reason,shutdown}, >> {offender, >> [{pid,<0.86.0>}, >> {name,couch_secondary_**services}, >> {mfargs, >> >> {couch_server_sup,start_**secondary_services, >> []}}, >> {restart_type,permanent}, >> {shutdown,infinity}, >> {child_type,supervisor}]}]}} >> [Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>] ** Generic server >> <0.407.0> terminating >> ** Last message in was delayed_commit >> ** When Server state == {db,<0.406.0>,<0.407.0>,nil,<<** >> "1323371423957954">>, >> <0.404.0>,<0.408.0>, >> {db_header,5,204982,0, >> {199491055,{204980,0}}, >> {199498140,204980}, >> {111685732,[]}, >> 0,nil,nil,1000}, >> 204982, >> {btree,<0.404.0>, >> {199513565,{205011,0}}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.404.0>, >> {199518784,205011}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.404.0>, >> {111685732,[]}, >> #Fun, >> #Fun, >> #Fun,**nil}, >> 205013, >> <<"database1">>, >> "/data/couchdb/data/database1.**couch", >> [],[],nil, >> {user_ctx,null,[],undefined}, >> #Ref<0.0.30.131014>,1000, >> [before_header,after_header,**on_file_open], >> false} >> ** Reason for termination == >> ** {timeout, >> {gen_server,call, >> [<0.406.0>, >> {db_updated, >> >> {db,<0.406.0>,<0.407.0>,nil,<<**"1323371423957954">>,<0.404.0>**, >> <0.408.0>, >> {db_header,5,205013,0, >> {199513565,{205011,0}}, >> {199518784,205011}, >> {111685732,[]}, >> 0,nil,nil,1000}, >> 205013, >> {btree,<0.404.0>, >> {199513565,{205011,0}}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.404.0>, >> {199518784,205011}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.404.0>, >> {111685732,[]}, >> #Fun, >> #Fun, >> #Fun,**nil}, >> 205013, >> <<"database1">>, >> "/data/couchdb/data/database1.**couch", >> [],[],nil, >> {user_ctx,null,[],undefined}, >> nil,1000, >> [before_header,after_header,**on_file_open], >> false}}]}} >> >> [Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>] >> {error_report,<0.31.0>, >> {<0.407.0>,crash_report, >> >> [[{initial_call,{couch_db_**updater,init,['Argument__1']}}**, >> {pid,<0.407.0>}, >> {registered_name,[]}, >> {error_info, >> {exit, >> {timeout, >> {gen_server,call, >> [<0.406.0>, >> {db_updated, >> {db,<0.406.0>,<0.407.0>,nil, >> <<"1323371423957954">>,<0.404.** >> 0>,<0.408.0>, >> {db_header,5,205013,0, >> {199513565,{205011,0}}, >> {199518784,205011}, >> {111685732,[]}, >> 0,nil,nil,1000}, >> 205013, >> {btree,<0.404.0>, >> {199513565,{205011,0}}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.404.0>, >> {199518784,205011}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.404.0>, >> {111685732,[]}, >> #Fun, >> #Fun, >> #Fun,**nil}, >> 205013, >> <<"database1">>, >> "/data/couchdb/data/database1.**couch", >> [],[],nil, >> {user_ctx,null,[],undefined}, >> nil,1000, >> [before_header,after_header,** >> on_file_open], >> false}}]}}, >> [{gen_server,terminate,6}, >> {proc_lib,init_p_do_apply,3}]}**}, >> {ancestors,[<0.406.0>,<0.403.**0>]}, >> {messages,[{'EXIT',<0.406.0>,**shutdown}]}, >> {links,[]}, >> {dictionary,[]}, >> {trap_exit,true}, >> {status,running}, >> {heap_size,28657}, >> {stack_size,24}, >> {reductions,4487709}], >> []]}} >> [Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>] ** Generic server >> <0.178.0> terminating >> ** Last message in was {update_docs,<0.2027.0>, >> [[{doc,<<"**55e776b94547442ab17b82bd1a0598** >> 43">>, >> {1, >> [<<102,77,172,235,192,72,84,** >> 223,58,68,105, >> 199,153,147,196,81>>]}, >> {[{<<"host">>,<<"Host1">>}, >> {<<"time">>,1323375464000}, >> {<<"text">>, >> <<"Some text">>}, >> {<<"level">>,0}, >> {<<"source">>,<<"source1">>}, >> {<<"type">>,<<"Entry1">>}]}, >> [],false,[]}], >> >> ... >> [[A BUNCH OF DOCS HERE]] >> ... >> >> >> {[{<<"host">>,<<"Host1">>}, >> {<<"time">>,1323375467000}, >> {<<"text">>, >> <<"Some text">>}, >> {<<"level">>,0}, >> {<<"source">>,<<"source1">>}, >> {<<"type">>,<<"Entry1">>}]}, >> [],false,[]}]], >> [],false,false} >> ** When Server state == {db,<0.177.0>,<0.178.0>,nil,<<** >> "1323371411352029">>, >> <0.175.0>,<0.179.0>, >> {db_header,5,13636863,0, >> {6776455960,{13636861,0}}, >> {6776479023,13636861}, >> {1039786,[]}, >> 0,nil,nil,1000}, >> 13636863, >> {btree,<0.175.0>, >> {6776455960,{13636861,0}}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.175.0>, >> {6776479023,13636861}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.175.0>, >> {1039786,[]}, >> #Fun, >> #Fun, >> #Fun,**nil}, >> 13636863, >> <<"database2">>, >> "/data/couchdb/data/database2.**couch", >> [],[],nil, >> {user_ctx,null,[],undefined}, >> nil,1000, >> [before_header,after_header,**on_file_open], >> false} >> ** Reason for termination == >> ** {timeout, >> {gen_server,call, >> [<0.177.0>, >> {db_updated, >> >> {db,<0.177.0>,<0.178.0>,nil,<<**"1323371411352029">>,<0.175.0>**, >> <0.179.0>, >> {db_header,5,13636863,0, >> {6776455960,{13636861,0}}, >> {6776479023,13636861}, >> {1039786,[]}, >> 0,nil,nil,1000}, >> 13636863, >> {btree,<0.175.0>, >> {6776557909,{13637061,0}}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.175.0>, >> {6776580448,13637061}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.175.0>, >> {1039786,[]}, >> #Fun, >> #Fun, >> #Fun,**nil}, >> 13637063, >> <<"database2">>, >> "/data/couchdb/data/database2.**couch", >> [],[],nil, >> {user_ctx,null,[],undefined}, >> #Ref<0.0.30.133811>,1000, >> [before_header,after_header,**on_file_open], >> false}}]}} >> >> [Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>] >> {error_report,<0.31.0>, >> {<0.178.0>,crash_report, >> >> [[{initial_call,{couch_db_**updater,init,['Argument__1']}}**, >> {pid,<0.178.0>}, >> {registered_name,[]}, >> {error_info, >> {exit, >> {timeout, >> {gen_server,call, >> [<0.177.0>, >> {db_updated, >> {db,<0.177.0>,<0.178.0>,nil, >> <<"1323371411352029">>,<0.175.** >> 0>,<0.179.0>, >> {db_header,5,13636863,0, >> {6776455960,{13636861,0}}, >> {6776479023,13636861}, >> {1039786,[]}, >> 0,nil,nil,1000}, >> 13636863, >> {btree,<0.175.0>, >> {6776557909,{13637061,0}}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.175.0>, >> {6776580448,13637061}, >> #Fun, >> #Fun, >> #Fun, >> #Fun}, >> {btree,<0.175.0>, >> {1039786,[]}, >> #Fun, >> #Fun, >> #Fun,**nil}, >> 13637063, >> <<"database2">>, >> "/data/couchdb/data/database2.**couch", >> [],[],nil, >> {user_ctx,null,[],undefined}, >> #Ref<0.0.30.133811>,1000, >> [before_header,after_header,** >> on_file_open], >> false}}]}}, >> [{gen_server,terminate,6}, >> {proc_lib,init_p_do_apply,3}]}**}, >> {ancestors,[<0.177.0>,<0.174.**0>]}, >> {messages, >> [{'EXIT',<0.177.0>,shutdown},**delayed_commit]}, >> {links,[]}, >> {dictionary,[]}, >> {trap_exit,true}, >> {status,running}, >> {heap_size,121393}, >> {stack_size,24}, >> {reductions,83311172}], >> []]}} >> >> > --0015174a050c1aacfb04b3a7dd50--