Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
MIME-Version: 1.0
References: <CAE8Pq+2Z0r-mtdSi_jUk+06kRQMayRfjKcXDZPPgD8YzEwV40Q@mail.gmail.com>
 <CAE8Pq+2FtpnNHJEQRJ3LgCcB27dt5dB0+iWBaaWpSrktj4=6tw@mail.gmail.com>
 <1943878772.657.1507053575183.JavaMail.Joan@RITA> <CAE8Pq+3kd+1R_bNgUqrAcWYvQtdcXRyb3Rf-G+JV-8L0OC86=w@mail.gmail.com>
 <CAE8Pq+3ZeT5Cv-R8d6fn+-qLbs0mEsOYu-7+8+Fr8Si2Sroyzw@mail.gmail.com>
 <CAE8Pq+1-8XOmb5LByVYv4xKjmZ59Jnz7oNswFn-ejOOia686OQ@mail.gmail.com> <CAE8Pq+2OoBB6YA4GpqgzjrM4Hjh5mdP3FsW7xL8+ti2r7ES5SA@mail.gmail.com>
In-Reply-To: <CAE8Pq+2OoBB6YA4GpqgzjrM4Hjh5mdP3FsW7xL8+ti2r7ES5SA@mail.gmail.com>
From: Carlos Alonso <carlos.alonso@cabify.com>
Date: Mon, 09 Oct 2017 10:53:00 +0000
Message-ID: <CAE8Pq+0W_tcYDVga0iVUgOzrow5eSPB=1MMNytbeXdwoXrewaA@mail.gmail.com>
Subject: Re: Trying to understand why a node gets 'frozen'
To: user@couchdb.apache.org, Joan Touzet <wohali@apache.org>
Content-Type: multipart/alternative; boundary="001a113d39a81cfdea055b1affdc"
archived-at: Mon, 09 Oct 2017 10:53:21 -0000

--001a113d39a81cfdea055b1affdc
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I'd like to connect a diagnosing tool such as etop, observer, ... to see
which processes are open there but I cannot seem to have it working.

Could anyone please share how to run any of those tools on a remote server?

Regards

On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso <carlos.alonso@cabify.com>
wrote:

> So I could find another relevant symptom. After adding _system endpoint
> monitoring I have discovered that the particular node has a different
> behaviour than the other ones in terms of Erlang process count.
>
> The process_count metric of the normal nodes is stable around 1k to 1.3k
> while the other node's process_count is slowly but continuously growing
> until a little above than 5k processes that is when it gets 'frozen'. Aft=
er
> restarting the value comes back to the normal 1k to 1.3k (to immediately
> start slowly growing again, of course :)).
>
> Any idea? Thanks!
>
> On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <carlos.alonso@cabify.com>
> wrote:
>
>> This is one of the complete errors sequences I can see:
>>
>> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
>> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1'
>> with exit value:
>>
>> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{fil=
e,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,=
"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.=
erl"},{line,595}
>>
>> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart=
.erl"},{line,208}]}]}
>>
>> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204>
>> aab326c0bb req_err(2515771787 <(251)%20577-1787>) badmatch : ok
>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
>> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/=
6
>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
>> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1 <0.20718.207>
>> -------- Replicator, request PUT to "
>> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=
=3Dfalse"
>> failed due to error {error,
>>     {'EXIT',
>>         {{{nocatch,{mp_parser_died,noproc}},
>> ...
>>
>> Regards
>>
>> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <carlos.alonso@cabify.com>
>> wrote:
>>
>>> The 'weird' thing about the mp_parser_died error is that, according to
>>> the description of the issue 745, the replication never finishes as the
>>> item that fails once, seems to fail forever, but in my case they fail, =
but
>>> then they seem to work (possibly as the replication is retried), as I c=
an
>>> find the documents that generated the error (in the logs) in the target
>>> db...
>>>
>>> Regards
>>>
>>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <carlos.alonso@cabify.com=
>
>>> wrote:
>>>
>>>> So to give some more context this node is responsible for replicating =
a
>>>> database that has quite many attachments and it raises the 'famous'
>>>> mp_parser_died,noproc error, that I think is this one:
>>>> https://github.com/apache/couchdb/issues/745
>>>>
>>>> What I've identified so far from the logs is that along with the error
>>>> described above, also this error appears:
>>>>
>>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
>>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>)
>>>> badmatch : ok
>>>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
>>>> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:header=
s/6
>>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
>>>>
>>>> Sometimes it appears just after the mp_parser_died error, sometimes th=
e
>>>> parser error happens without 'triggering' one of this badmatch ones.
>>>>
>>>> Then, after a while of this sequence, the initially described
>>>> sel_conn_closed error starts raising for all requests and the node get=
s
>>>> frozen. It is not responsive but it is still not removed from the clus=
ter,
>>>> holding its replications and, obviously, not replicating anything unti=
l it
>>>> is restarted.
>>>>
>>>> I can also see interleaved unauthorized errors, which don't make much
>>>> sense as I'm the only one accessing this cluster
>>>>
>>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
>>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are n=
ot
>>>> authorized to access this db.">>} [{couch_db,open,2
>>>>
>>>> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{fil=
e,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi=
_server.erl"},{line,139}]}]
>>>>
>>>>
>>>> To me, it feels like the mp_parser_died error slowly breaks something
>>>> that in the end brings the node unresponsive, as those errors happen a=
 lot
>>>> in that particular replication.
>>>>
>>>> Regards and thanks a lot for your help!
>>>>
>>>>
>>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wohali@apache.org> wrote:
>>>>
>>>>> Is there more to the error? All this shows us is that the replicator
>>>>> itself attempted a POST and had the connection closed on it. (Remembe=
r
>>>>> that the replicator is basically just a custom client that sits
>>>>> alongside CouchDB on the same machine.) There should be more to the
>>>>> error log that shows why CouchDB hung up the phone.
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Carlos Alonso" <carlos.alonso@cabify.com>
>>>>> To: "user" <user@couchdb.apache.org>
>>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
>>>>> Subject: Re: Trying to understand why a node gets 'frozen'
>>>>>
>>>>> Hello, this is happening every day, always on the same node. Any idea=
s?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <
>>>>> carlos.alonso@cabify.com>
>>>>> wrote:
>>>>>
>>>>> > Hello everyone!!
>>>>> >
>>>>> > I'm trying to understand an issue we're experiencing on CouchDB 2.1=
.0
>>>>> > running on Ubuntu 14.04. The cluster itself is currently replicatin=
g
>>>>> from
>>>>> > another source cluster and we have seen that one node gets frozen
>>>>> from time
>>>>> > to time having to restart it to get it to respond again.
>>>>> >
>>>>> > Before getting unresponsive, the node throws a lot of {error,
>>>>> > sel_conn_closed}. See an example trace below.
>>>>> >
>>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
>>>>> > -------- gen_server <0.13489.0> terminated with reason:
>>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
>>>>> > {'EXIT',{http_request_failed,\"POST\",\n
>>>>>  \"
>>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>>>> >        {error,sel_conn_closed}}}">>}
>>>>> >   last msg:
>>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
>>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
>>>>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>>>> >                  {error,sel_conn_closed}}}">>}}
>>>>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
>>>>> > https://source_ip/mydb/
>>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>>>> >
>>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true}=
,{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3=
},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
>>>>> > http://127.0.0.1:5984/mydb/
>>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>>>> >
>>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_optio=
ns,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined=
},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read=
,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df=
84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d0=
6a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">=
>],605}}
>>>>> >
>>>>> > The particular node is 'responsible' for a replication that has
>>>>> quite many
>>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
>>>>> > https://github.com/apache/couchdb/issues/745), but I don't know if
>>>>> that
>>>>> > may have any relationship.
>>>>> >
>>>>> > When that happens, just restarting the node brings it up and runnin=
g
>>>>> > properly.
>>>>> >
>>>>> > Any help would be really appreciated.
>>>>> >
>>>>> > Regards
>>>>> > --
>>>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>>> >
>>>>> > *Carlos Alonso*
>>>>> > Data Engineer
>>>>> > Madrid, Spain
>>>>> >
>>>>> > carlos.alonso@cabify.com
>>>>> >
>>>>> > Prueba gratis con este c=C3=B3digo
>>>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>>>> >[image:
>>>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>>> >
>>>>> --
>>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>>>
>>>>> *Carlos Alonso*
>>>>> Data Engineer
>>>>> Madrid, Spain
>>>>>
>>>>> carlos.alonso@cabify.com
>>>>>
>>>>> Prueba gratis con este c=C3=B3digo
>>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>>>> >[image:
>>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>>>
>>>>> --
>>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a
>>>>> su
>>>>> destinatario, pudiendo contener informaci=C3=B3n confidencial sometid=
a a
>>>>> secreto
>>>>> profesional. No est=C3=A1 permitida su reproducci=C3=B3n o distribuci=
=C3=B3n sin la
>>>>> autorizaci=C3=B3n expresa de Cabify. Si usted no es el destinatario f=
inal
>>>>> por
>>>>> favor elim=C3=ADnelo e inf=C3=B3rmenos por esta v=C3=ADa.
>>>>>
>>>>> This message and any attached file are intended exclusively for the
>>>>> addressee, and it may be confidential. You are not allowed to copy or
>>>>> disclose it without Cabify's prior written authorization. If you are
>>>>> not
>>>>> the intended recipient please delete it from your system and notify u=
s
>>>>> by
>>>>> e-mail.
>>>>>
>>>> --
>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>>
>>>> *Carlos Alonso*
>>>> Data Engineer
>>>> Madrid, Spain
>>>>
>>>> carlos.alonso@cabify.com
>>>>
>>>> Prueba gratis con este c=C3=B3digo
>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[im=
age:
>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>>
>>> --
>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>
>>> *Carlos Alonso*
>>> Data Engineer
>>> Madrid, Spain
>>>
>>> carlos.alonso@cabify.com
>>>
>>> Prueba gratis con este c=C3=B3digo
>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[ima=
ge:
>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>
>> --
>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>
>> *Carlos Alonso*
>> Data Engineer
>> Madrid, Spain
>>
>> carlos.alonso@cabify.com
>>
>> Prueba gratis con este c=C3=B3digo
>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[imag=
e:
>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este c=C3=B3digo
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image=
:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
--=20
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este c=C3=B3digo
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

--=20
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su=20
destinatario, pudiendo contener informaci=C3=B3n confidencial sometida a se=
creto=20
profesional. No est=C3=A1 permitida su reproducci=C3=B3n o distribuci=C3=B3=
n sin la=20
autorizaci=C3=B3n expresa de Cabify. Si usted no es el destinatario final p=
or=20
favor elim=C3=ADnelo e inf=C3=B3rmenos por esta v=C3=ADa.=20

This message and any attached file are intended exclusively for the=20
addressee, and it may be confidential. You are not allowed to copy or=20
disclose it without Cabify's prior written authorization. If you are not=20
the intended recipient please delete it from your system and notify us by=
=20
e-mail.

--001a113d39a81cfdea055b1affdc--