mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From haosdent <haosd...@gmail.com>
Subject Re: Tasks with failed health-checks intermittently not restarted
Date Sun, 11 Oct 2015 10:18:51 GMT
Could not reproduce your problem in my side. But I guess it maybe related
to this ticket. MESOS-1613
<https://issues.apache.org/jira/browse/MESOS-1613>
HealthCheckTest.ConsecutiveFailures
is flaky

On Fri, Oct 9, 2015 at 12:13 PM, haosdent <haosdent@gmail.com> wrote:

> I think it maybe because health check exit before executor receive
> the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>
> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <outtatime@gmail.com> wrote:
>
>> Following up on this:
>>
>> This problem is reproducible when the command is "exit 1".
>>
>> Once I set it to a real curl cmd the intermittent failures stopped and
>> health checks worked as advertised.
>>
>>
>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <outtatime@gmail.com> wrote:
>>
>> Using the health-check following parameters:
>>
>> cmd="exit 1"
>> delay=5.0
>> grace-period=10.0
>> interval=10.0
>> timeout=10.0
>> consecutiveFailures=3
>>
>> Sometimes the tasks are successfully identified as failing and restarted,
>> however other times the health-check command exits yet the task is left in
>> a running state and the failure is ignored.
>>
>> Sample of failed Mesos task log:
>>
>> STDOUT:
>>
>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --stop_timeout="0ns"
>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --stop_timeout="0ns"
>>> Registered docker executor on mesos-worker2a
>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>> Launching health check process: /usr/libexec/mesos/mesos-health-check
>>> --executor=(1)@192.168.225.59:38776
>>> --health_check_json={"command":{"shell":true,"value":"docker exec
>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>>> sh -c \" exit 1
>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>
>>> *Health check process launched at pid: 7525*
>>> *Received task health update, healthy: false**Received task health
>>> update, healthy: false*
>>
>>
>>
>> STDERR:
>>
>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>> WARNING: Your kernel does not support swap limit capabilities, memory
>>> limited without swap.
>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health
>>> check still in grace period
>>> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
>>> command check exited with status 1*
>>
>>
>> Screenshot of the task still running despite health-check exited with
>> status code 1:
>>
>> http://i.imgur.com/zx9GQuo.png
>>
>> The expected behavior when the health-check binary has exited w/ non-zero
>> status is that the task would be killed and restarted (rather than
>> continuing to run as outlined above).
>>
>> -----
>> Additional note: After hard-coding the "path" string of the health-check
>> binary parent dir into b/src/docker/executor.cpp, I am able to at least
>> test the functionality.  The other issue of health-checks for docker tasks
>> failing to start is still unresolved due to the unpropagated
>> MESOS_LAUNCH_DIR issue.
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Mime
View raw message