Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A257B200D1A for ; Mon, 9 Oct 2017 12:53:21 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A0BDE1609E0; Mon, 9 Oct 2017 10:53:21 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9659F1609BB for ; Mon, 9 Oct 2017 12:53:20 +0200 (CEST) Received: (qmail 28687 invoked by uid 500); 9 Oct 2017 10:53:19 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 28674 invoked by uid 99); 9 Oct 2017 10:53:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Oct 2017 10:53:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 29E2618BBC1 for ; Mon, 9 Oct 2017 10:53:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.211 X-Spam-Level: * X-Spam-Status: No, score=1.211 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, KAM_BADIPHTTP=2, KAM_SHORT=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cabify-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 7hejb7R02tbT for ; Mon, 9 Oct 2017 10:53:12 +0000 (UTC) Received: from mail-vk0-f43.google.com (mail-vk0-f43.google.com [209.85.213.43]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4569360D88 for ; Mon, 9 Oct 2017 10:53:12 +0000 (UTC) Received: by mail-vk0-f43.google.com with SMTP id 137so9935563vkk.8 for ; Mon, 09 Oct 2017 03:53:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cabify-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=vGYb1zUBjGVqalhg0xJww7CVvfXRxUN+jsP8Eo2rdds=; b=jtSAfu27I9DAkEKWF0C9XlE+WNSckbmnq/a5Q5wIdf0YYaPcwz9poOHpn+6gzBf7yJ KGPdK2aFDAAHg4xKpo4YTWVUulQIEr2Hu+I5fuDCUFFWhmf3DAQ0XKRd+P0tsDRANxj7 qWWCxm5k8FsLwzsHxfHlU1h6BO+3mx9DvC5QjtDLDviK/nF8ps5af4dQsG4uRgu/oPcP 0JOFirNFHsjI7naNMc88sY+3Hos9calOds6gd14ReAQdy9LtmLeKzuHTOjnT8FixDZhJ GE8eNnyJT/WLJKnAHGcNJKPHLf956mgz0AbbuFlmMMxxTm0BJonvBEINQm6k8W/FL0Tt bMGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=vGYb1zUBjGVqalhg0xJww7CVvfXRxUN+jsP8Eo2rdds=; b=Sds8RnjsC9Ip0VpSDTKImIGudjASap5eFo+nILPAS608SFTHmihwSyAN9S4kZyhW5P eQFGoEuJDkTaUKzPiEw0B6U9FAxjPoSooMub87yDr2yir8a08FIBxYN2Mr5mbrlsk9vf yR2OhEIIJ61AxaHTqMMgg75WQ+bOFATcDMTlRMUOPxHqhbEU/KiHM+uVBnII/x6yElKt VsOIGrVuhjV8CyrevVyBvKPLbxnl9BD3lebc3ArbHC6sIisC3HTDWQIErsbp39Yil6+R ZaUJgMC2g5vWSXeylfa2n7n+FKzD9lomfRWfiJKn59O2Oox1MezySiFFfoB3Z46K9j3z f6AQ== X-Gm-Message-State: AMCzsaVxmXqxQQbrn9aItVey4vpmHgXukArmcVXMxBRVVcRoUU/4HdOC im+cme0noq5QejZRLebSfgzU7lFiOxCrp/HrID8CwDxqdRed+F6ZsuPFPQ+VuQQG2J+8LyEi46J lFLrnLaA6f1pcwyAEwki+ymgVodZGnw== X-Google-Smtp-Source: AOwi7QDsXsSpGvxDO1ALnPBNEufRqGUFcXrOF70GEU06cFDYTnm5kK3xeQZV/DeMEGA+MZaYYCOCuZNqJfSkdXA1Shg= X-Received: by 10.31.151.10 with SMTP id z10mr3946999vkd.189.1507546390913; Mon, 09 Oct 2017 03:53:10 -0700 (PDT) MIME-Version: 1.0 References: <1943878772.657.1507053575183.JavaMail.Joan@RITA> In-Reply-To: From: Carlos Alonso Date: Mon, 09 Oct 2017 10:53:00 +0000 Message-ID: Subject: Re: Trying to understand why a node gets 'frozen' To: user@couchdb.apache.org, Joan Touzet Content-Type: multipart/alternative; boundary="001a113d39a81cfdea055b1affdc" archived-at: Mon, 09 Oct 2017 10:53:21 -0000 --001a113d39a81cfdea055b1affdc Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'd like to connect a diagnosing tool such as etop, observer, ... to see which processes are open there but I cannot seem to have it working. Could anyone please share how to run any of those tools on a remote server? Regards On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso wrote: > So I could find another relevant symptom. After adding _system endpoint > monitoring I have discovered that the particular node has a different > behaviour than the other ones in terms of Erlang process count. > > The process_count metric of the normal nodes is stable around 1k to 1.3k > while the other node's process_count is slowly but continuously growing > until a little above than 5k processes that is when it gets 'frozen'. Aft= er > restarting the value comes back to the normal 1k to 1.3k (to immediately > start slowly growing again, of course :)). > > Any idea? Thanks! > > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso > wrote: > >> This is one of the complete errors sequences I can see: >> >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator >> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1' >> with exit value: >> >> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{fil= e,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,= "src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.= erl"},{line,595} >> >> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart= .erl"},{line,208}]}]} >> >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204> >> aab326c0bb req_err(2515771787 <(251)%20577-1787>) badmatch : ok >> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 >> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/= 6 >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>] >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1 <0.20718.207> >> -------- Replicator, request PUT to " >> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits= =3Dfalse" >> failed due to error {error, >> {'EXIT', >> {{{nocatch,{mp_parser_died,noproc}}, >> ... >> >> Regards >> >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso >> wrote: >> >>> The 'weird' thing about the mp_parser_died error is that, according to >>> the description of the issue 745, the replication never finishes as the >>> item that fails once, seems to fail forever, but in my case they fail, = but >>> then they seem to work (possibly as the replication is retried), as I c= an >>> find the documents that generated the error (in the logs) in the target >>> db... >>> >>> Regards >>> >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso >>> wrote: >>> >>>> So to give some more context this node is responsible for replicating = a >>>> database that has quite many attachments and it raises the 'famous' >>>> mp_parser_died,noproc error, that I think is this one: >>>> https://github.com/apache/couchdb/issues/745 >>>> >>>> What I've identified so far from the logs is that along with the error >>>> described above, also this error appears: >>>> >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1 >>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>) >>>> badmatch : ok >>>> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 >>>> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:header= s/6 >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>] >>>> >>>> Sometimes it appears just after the mp_parser_died error, sometimes th= e >>>> parser error happens without 'triggering' one of this badmatch ones. >>>> >>>> Then, after a while of this sequence, the initially described >>>> sel_conn_closed error starts raising for all requests and the node get= s >>>> frozen. It is not responsive but it is still not removed from the clus= ter, >>>> holding its replications and, obviously, not replicating anything unti= l it >>>> is restarted. >>>> >>>> I can also see interleaved unauthorized errors, which don't make much >>>> sense as I'm the only one accessing this cluster >>>> >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1 >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are n= ot >>>> authorized to access this db.">>} [{couch_db,open,2 >>>> >>>> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{fil= e,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi= _server.erl"},{line,139}]}] >>>> >>>> >>>> To me, it feels like the mp_parser_died error slowly breaks something >>>> that in the end brings the node unresponsive, as those errors happen a= lot >>>> in that particular replication. >>>> >>>> Regards and thanks a lot for your help! >>>> >>>> >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet wrote: >>>> >>>>> Is there more to the error? All this shows us is that the replicator >>>>> itself attempted a POST and had the connection closed on it. (Remembe= r >>>>> that the replicator is basically just a custom client that sits >>>>> alongside CouchDB on the same machine.) There should be more to the >>>>> error log that shows why CouchDB hung up the phone. >>>>> >>>>> ----- Original Message ----- >>>>> From: "Carlos Alonso" >>>>> To: "user" >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM >>>>> Subject: Re: Trying to understand why a node gets 'frozen' >>>>> >>>>> Hello, this is happening every day, always on the same node. Any idea= s? >>>>> >>>>> Thanks! >>>>> >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso < >>>>> carlos.alonso@cabify.com> >>>>> wrote: >>>>> >>>>> > Hello everyone!! >>>>> > >>>>> > I'm trying to understand an issue we're experiencing on CouchDB 2.1= .0 >>>>> > running on Ubuntu 14.04. The cluster itself is currently replicatin= g >>>>> from >>>>> > another source cluster and we have seen that one node gets frozen >>>>> from time >>>>> > to time having to restart it to get it to respond again. >>>>> > >>>>> > Before getting unresponsive, the node throws a lot of {error, >>>>> > sel_conn_closed}. See an example trace below. >>>>> > >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0> >>>>> > -------- gen_server <0.13489.0> terminated with reason: >>>>> > {checkpoint_commit_failure,<<"Failure on target commit: >>>>> > {'EXIT',{http_request_failed,\"POST\",\n >>>>> \" >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n >>>>> > {error,sel_conn_closed}}}">>} >>>>> > last msg: >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n >>>>> > \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n >>>>> > {error,sel_conn_closed}}}">>}} >>>>> > state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb," >>>>> > https://source_ip/mydb/ >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic >>>>> > >>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true}= ,{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3= },{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb," >>>>> > http://127.0.0.1:5984/mydb/ >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic >>>>> > >>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_optio= ns,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined= },[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read= ,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df= 84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d0= 6a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">= >],605}} >>>>> > >>>>> > The particular node is 'responsible' for a replication that has >>>>> quite many >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug ( >>>>> > https://github.com/apache/couchdb/issues/745), but I don't know if >>>>> that >>>>> > may have any relationship. >>>>> > >>>>> > When that happens, just restarting the node brings it up and runnin= g >>>>> > properly. >>>>> > >>>>> > Any help would be really appreciated. >>>>> > >>>>> > Regards >>>>> > -- >>>>> > [image: Cabify - Your private Driver] >>>>> > >>>>> > *Carlos Alonso* >>>>> > Data Engineer >>>>> > Madrid, Spain >>>>> > >>>>> > carlos.alonso@cabify.com >>>>> > >>>>> > Prueba gratis con este c=C3=B3digo >>>>> > #CARLOSA6319 >>>>> > [image: Facebook] [image: Twitter] >>>>> > [image: Instagram] >>>> >[image: >>>>> > Linkedin] >>>>> > >>>>> -- >>>>> [image: Cabify - Your private Driver] >>>>> >>>>> *Carlos Alonso* >>>>> Data Engineer >>>>> Madrid, Spain >>>>> >>>>> carlos.alonso@cabify.com >>>>> >>>>> Prueba gratis con este c=C3=B3digo >>>>> #CARLOSA6319 >>>>> [image: Facebook] [image: Twitter] >>>>> [image: Instagram] >>>> >[image: >>>>> Linkedin] >>>>> >>>>> -- >>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a >>>>> su >>>>> destinatario, pudiendo contener informaci=C3=B3n confidencial sometid= a a >>>>> secreto >>>>> profesional. No est=C3=A1 permitida su reproducci=C3=B3n o distribuci= =C3=B3n sin la >>>>> autorizaci=C3=B3n expresa de Cabify. Si usted no es el destinatario f= inal >>>>> por >>>>> favor elim=C3=ADnelo e inf=C3=B3rmenos por esta v=C3=ADa. >>>>> >>>>> This message and any attached file are intended exclusively for the >>>>> addressee, and it may be confidential. You are not allowed to copy or >>>>> disclose it without Cabify's prior written authorization. If you are >>>>> not >>>>> the intended recipient please delete it from your system and notify u= s >>>>> by >>>>> e-mail. >>>>> >>>> -- >>>> [image: Cabify - Your private Driver] >>>> >>>> *Carlos Alonso* >>>> Data Engineer >>>> Madrid, Spain >>>> >>>> carlos.alonso@cabify.com >>>> >>>> Prueba gratis con este c=C3=B3digo >>>> #CARLOSA6319 >>>> [image: Facebook] [image: Twitter] >>>> [image: Instagram] [im= age: >>>> Linkedin] >>>> >>> -- >>> [image: Cabify - Your private Driver] >>> >>> *Carlos Alonso* >>> Data Engineer >>> Madrid, Spain >>> >>> carlos.alonso@cabify.com >>> >>> Prueba gratis con este c=C3=B3digo >>> #CARLOSA6319 >>> [image: Facebook] [image: Twitter] >>> [image: Instagram] [ima= ge: >>> Linkedin] >>> >> -- >> [image: Cabify - Your private Driver] >> >> *Carlos Alonso* >> Data Engineer >> Madrid, Spain >> >> carlos.alonso@cabify.com >> >> Prueba gratis con este c=C3=B3digo >> #CARLOSA6319 >> [image: Facebook] [image: Twitter] >> [image: Instagram] [imag= e: >> Linkedin] >> > -- > [image: Cabify - Your private Driver] > > *Carlos Alonso* > Data Engineer > Madrid, Spain > > carlos.alonso@cabify.com > > Prueba gratis con este c=C3=B3digo > #CARLOSA6319 > [image: Facebook] [image: Twitter] > [image: Instagram] [image= : > Linkedin] > --=20 [image: Cabify - Your private Driver] *Carlos Alonso* Data Engineer Madrid, Spain carlos.alonso@cabify.com Prueba gratis con este c=C3=B3digo #CARLOSA6319 [image: Facebook] [image: Twitter] [image: Instagram] [image: Linkedin] --=20 Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su=20 destinatario, pudiendo contener informaci=C3=B3n confidencial sometida a se= creto=20 profesional. No est=C3=A1 permitida su reproducci=C3=B3n o distribuci=C3=B3= n sin la=20 autorizaci=C3=B3n expresa de Cabify. Si usted no es el destinatario final p= or=20 favor elim=C3=ADnelo e inf=C3=B3rmenos por esta v=C3=ADa.=20 This message and any attached file are intended exclusively for the=20 addressee, and it may be confidential. You are not allowed to copy or=20 disclose it without Cabify's prior written authorization. If you are not=20 the intended recipient please delete it from your system and notify us by= =20 e-mail. --001a113d39a81cfdea055b1affdc--