Mailing-List: contact user-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mesos.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAFt=RONyKDNMrZwGp5pSHR+NGyAg0FgkDOUJ8gJpHRH83L8pag@mail.gmail.com>
References: 
 <CAEOBz7Onx+=zrX+ziCa-p6a9OUM=1y3sEg=AbtWkSJAHYAB-eQ@mail.gmail.com>
	<80DDCD2F-2EE7-4D49-AB66-DBEFB6DDB4CD@gmail.com>
	<CAFt=RONyKDNMrZwGp5pSHR+NGyAg0FgkDOUJ8gJpHRH83L8pag@mail.gmail.com>
Date: Sun, 11 Oct 2015 18:18:51 +0800
Message-ID: 
 <CAFt=ROPZA6dgJMQu9W+2k6KgtJo8MY2Hp=-S40c37CTg4FBQQA@mail.gmail.com>
Subject: Re: Tasks with failed health-checks intermittently not restarted
From: haosdent <haosdent@gmail.com>
To: user <user@mesos.apache.org>
Content-Type: multipart/alternative; boundary=f46d041825d80e9aa30521d18a7e

--f46d041825d80e9aa30521d18a7e
Content-Type: text/plain; charset=UTF-8

Could not reproduce your problem in my side. But I guess it maybe related
to this ticket. MESOS-1613
<https://issues.apache.org/jira/browse/MESOS-1613>
HealthCheckTest.ConsecutiveFailures
is flaky

On Fri, Oct 9, 2015 at 12:13 PM, haosdent <haosdent@gmail.com> wrote:

> I think it maybe because health check exit before executor receive
> the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>
> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <outtatime@gmail.com> wrote:
>
>> Following up on this:
>>
>> This problem is reproducible when the command is "exit 1".
>>
>> Once I set it to a real curl cmd the intermittent failures stopped and
>> health checks worked as advertised.
>>
>>
>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <outtatime@gmail.com> wrote:
>>
>> Using the health-check following parameters:
>>
>> cmd="exit 1"
>> delay=5.0
>> grace-period=10.0
>> interval=10.0
>> timeout=10.0
>> consecutiveFailures=3
>>
>> Sometimes the tasks are successfully identified as failing and restarted,
>> however other times the health-check command exits yet the task is left in
>> a running state and the failure is ignored.
>>
>> Sample of failed Mesos task log:
>>
>> STDOUT:
>>
>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --stop_timeout="0ns"
>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --stop_timeout="0ns"
>>> Registered docker executor on mesos-worker2a
>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>> Launching health check process: /usr/libexec/mesos/mesos-health-check
>>> --executor=(1)@192.168.225.59:38776
>>> --health_check_json={"command":{"shell":true,"value":"docker exec
>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>>> sh -c \" exit 1
>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>
>>> *Health check process launched at pid: 7525*
>>> *Received task health update, healthy: false**Received task health
>>> update, healthy: false*
>>
>>
>>
>> STDERR:
>>
>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>> WARNING: Your kernel does not support swap limit capabilities, memory
>>> limited without swap.
>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health
>>> check still in grace period
>>> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
>>> command check exited with status 1*
>>
>>
>> Screenshot of the task still running despite health-check exited with
>> status code 1:
>>
>> http://i.imgur.com/zx9GQuo.png
>>
>> The expected behavior when the health-check binary has exited w/ non-zero
>> status is that the task would be killed and restarted (rather than
>> continuing to run as outlined above).
>>
>> -----
>> Additional note: After hard-coding the "path" string of the health-check
>> binary parent dir into b/src/docker/executor.cpp, I am able to at least
>> test the functionality.  The other issue of health-checks for docker tasks
>> failing to start is still unresolved due to the unpropagated
>> MESOS_LAUNCH_DIR issue.
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>


-- 
Best Regards,
Haosdent Huang

--f46d041825d80e9aa30521d18a7e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Could not reproduce your problem in my side. But I guess i=
t maybe related to this ticket. <a href=3D"https://issues.apache.org/jira/b=
rowse/MESOS-1613">MESOS-1613</a>=C2=A0HealthCheckTest.ConsecutiveFailures i=
s flaky</div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On F=
ri, Oct 9, 2015 at 12:13 PM, haosdent <span dir=3D"ltr">&lt;<a href=3D"mail=
to:haosdent@gmail.com" target=3D"_blank">haosdent@gmail.com</a>&gt;</span> =
wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">I think it maybe =
because health check exit before executor receive the=C2=A0TaskHealthStatus=
. I would try &quot;exit 1&quot; and give your feedback later.</div><div cl=
ass=3D"gmail_extra"><div><div class=3D"h5"><br><div class=3D"gmail_quote">O=
n Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <span dir=3D"ltr">&lt;<a href=3D=
"mailto:outtatime@gmail.com" target=3D"_blank">outtatime@gmail.com</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"auto"><div>Foll=
owing up on this:</div><div><br></div><div>This problem is reproducible whe=
n the command is &quot;exit 1&quot;.</div><div><br></div><div>Once I set it=
 to a real curl cmd the intermittent failures stopped and health checks wor=
ked as advertised.<br><br></div><div><div><div><br>On Oct 8, 2015, at 12:45=
 PM, Jay Taylor &lt;<a href=3D"mailto:outtatime@gmail.com" target=3D"_blank=
">outtatime@gmail.com</a>&gt; wrote:<br><br></div><blockquote type=3D"cite"=
><div><div dir=3D"ltr"><div>Using the health-check following parameters:<br=
></div><div><br></div><div>cmd=3D&quot;exit 1&quot;</div><div>delay=3D5.0</=
div><div>grace-period=3D10.0</div><div>interval=3D10.0</div><div>timeout=3D=
10.0</div><div>consecutiveFailures=3D3</div><div><br></div><div>Sometimes t=
he tasks are successfully identified as failing and restarted, however othe=
r times the health-check command exits yet the task is left in a running st=
ate and the failure is ignored.</div><div><br></div><div>Sample of failed M=
esos task log:<br><div><br></div><div>STDOUT:</div><div><br></div><div><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;paddi=
ng-left:1ex"><font face=3D"monospace, monospace">--container=3D&quot;mesos-=
61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b66=
2&quot; --docker=3D&quot;docker&quot; --docker_socket=3D&quot;/var/run/dock=
er.sock&quot; --help=3D&quot;false&quot; --initialize_driver_logging=3D&quo=
t;true&quot; --logbufsecs=3D&quot;0&quot; --logging_level=3D&quot;INFO&quot=
; --mapped_directory=3D&quot;/mnt/mesos/sandbox&quot; --quiet=3D&quot;false=
&quot; --sandbox_directory=3D&quot;/tmp/mesos/slaves/61373c0e-7349-4173-ab8=
d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/exec=
utors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-f=
fba-47d8-8a8a-b6cb0c58b662&quot; --stop_timeout=3D&quot;0ns&quot;<br>--cont=
ainer=3D&quot;mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-4=
7d8-8a8a-b6cb0c58b662&quot; --docker=3D&quot;docker&quot; --docker_socket=
=3D&quot;/var/run/docker.sock&quot; --help=3D&quot;false&quot; --initialize=
_driver_logging=3D&quot;true&quot; --logbufsecs=3D&quot;0&quot; --logging_l=
evel=3D&quot;INFO&quot; --mapped_directory=3D&quot;/mnt/mesos/sandbox&quot;=
 --quiet=3D&quot;false&quot; --sandbox_directory=3D&quot;/tmp/mesos/slaves/=
61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-16086243=
20-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2ca=
b65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662&quot; --stop_timeout=3D&qu=
ot;0ns&quot;<br>Registered docker executor on mesos-worker2a<br>Starting ta=
sk hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631<br>Launching healt=
h check process: /usr/libexec/mesos/mesos-health-check --executor=3D(1)@<a =
href=3D"http://192.168.225.59:38776" target=3D"_blank">192.168.225.59:38776=
</a> --health_check_json=3D{&quot;command&quot;:{&quot;shell&quot;:true,&qu=
ot;value&quot;:&quot;docker exec mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30=
-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 sh -c \&quot; exit 1 \&quot;&quot;=
},&quot;consecutive_failures&quot;:3,&quot;delay_seconds&quot;:5.0,&quot;gr=
ace_period_seconds&quot;:10.0,&quot;interval_seconds&quot;:10.0,&quot;timeo=
ut_seconds&quot;:10.0} --task_id=3Dhello-app_web-v3.d14ba30e-6401-4044-a97a=
-86a2cab65631<br><b>Health check process launched at pid: 7525<br></b><b>Re=
ceived task health update, healthy: false<br></b><b>Received task health up=
date, healthy: false</b></font></blockquote><div><br></div><div><br></div><=
div>STDERR:<br><div><br></div><div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(20=
4,204,204);border-left-style:solid;padding-left:1ex"><font face=3D"monospac=
e, monospace">I1008 19:30:02.569856 =C2=A07408 exec.cpp:134] Version: 0.26.=
0<br>I1008 19:30:02.571815 =C2=A07411 exec.cpp:208] Executor registered on =
slave 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1<br>WARNING: Your kernel does =
not support swap limit capabilities, memory limited without swap.<br>WARNIN=
G: Logging before InitGoogleLogging() is written to STDERR<br>I1008 19:30:0=
8.527354 =C2=A07533 main.cpp:100] Ignoring failure as health check still in=
 grace period<br><b>W1008 19:30:38.912325 =C2=A07525 main.cpp:375] Health c=
heck failed Health command check exited with status 1</b></font></blockquot=
e></div></div></div></div><div><b><br></b></div><div>Screenshot of the task=
 still running despite health-check exited with status code 1:</div><div><b=
r></div><div><a href=3D"http://i.imgur.com/zx9GQuo.png" target=3D"_blank">h=
ttp://i.imgur.com/zx9GQuo.png</a><br></div><div><br></div><div>The expected=
 behavior when the health-check binary has exited w/ non-zero status is tha=
t the task would be killed and restarted (rather than continuing to run as =
outlined above).</div><div><br></div><div>-----<div>Additional note: After =
hard-coding the &quot;path&quot; string of the health-check binary parent d=
ir into b/src/docker/executor.cpp, I am able to at least test the functiona=
lity.=C2=A0 The other issue of health-checks for docker tasks failing to st=
art is still unresolved due to the unpropagated MESOS_LAUNCH_DIR issue.</di=
v></div></div>
</div></blockquote></div></div></div></blockquote></div><br><br clear=3D"al=
l"><div><br></div></div></div><span class=3D"HOEnZb"><font color=3D"#888888=
">-- <br><div>Best Regards,<br><div>Haosdent Huang</div></div>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature">Best Regards,<br><div>Haosdent Huang</div></div>
</div>

--f46d041825d80e9aa30521d18a7e--