Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CB7AB187F6 for ; Sun, 11 Oct 2015 10:19:08 +0000 (UTC) Received: (qmail 31443 invoked by uid 500); 11 Oct 2015 10:19:02 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 31379 invoked by uid 500); 11 Oct 2015 10:19:02 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 31369 invoked by uid 99); 11 Oct 2015 10:19:02 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Oct 2015 10:19:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B08B8C0729 for ; Sun, 11 Oct 2015 10:19:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.902 X-Spam-Level: ** X-Spam-Status: No, score=2.902 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, NORMAL_HTTP_TO_IP=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id HEbkw0ZFlBHo for ; Sun, 11 Oct 2015 10:18:53 +0000 (UTC) Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 26404204CB for ; Sun, 11 Oct 2015 10:18:53 +0000 (UTC) Received: by wicgb1 with SMTP id gb1so117039499wic.1 for ; Sun, 11 Oct 2015 03:18:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cNkbSrWIJdpbHf0G19r6S/ML7quB12bBft27e85bDes=; b=Jgny3usrVv2DJ8IqyOlme9psgioi//t/gS3hu9qz3eRwIoHDTTww7GPfY2iwmTH5WM nVfd94jhGvUdP/WA0dFMWCpViGIjOp9fS58hBBjNfH8YHEhATpMRhMESEWXfDMORuiRx L33tBhNgwkcKrDbTiWSKE/a6Fzg1pA+KArJS6scaekttlxQb47cK60NfGrLW6m8Ep3P0 +NqVF49AGaOd1pmx41fpq8qUGTYZwUSQW6QohTXlJPOv4xx28pj6yRt2os4NbLw3QmMm xv4uPiw2n24iYS7OkZLFeYpg5ZXIEwrlLSr7AqISjzz27RiDGKoxGIjZ3JLixHCFpduE M3Qw== MIME-Version: 1.0 X-Received: by 10.180.79.34 with SMTP id g2mr9147578wix.28.1444558731644; Sun, 11 Oct 2015 03:18:51 -0700 (PDT) Received: by 10.28.26.203 with HTTP; Sun, 11 Oct 2015 03:18:51 -0700 (PDT) In-Reply-To: References: <80DDCD2F-2EE7-4D49-AB66-DBEFB6DDB4CD@gmail.com> Date: Sun, 11 Oct 2015 18:18:51 +0800 Message-ID: Subject: Re: Tasks with failed health-checks intermittently not restarted From: haosdent To: user Content-Type: multipart/alternative; boundary=f46d041825d80e9aa30521d18a7e --f46d041825d80e9aa30521d18a7e Content-Type: text/plain; charset=UTF-8 Could not reproduce your problem in my side. But I guess it maybe related to this ticket. MESOS-1613 HealthCheckTest.ConsecutiveFailures is flaky On Fri, Oct 9, 2015 at 12:13 PM, haosdent wrote: > I think it maybe because health check exit before executor receive > the TaskHealthStatus. I would try "exit 1" and give your feedback later. > > On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor wrote: > >> Following up on this: >> >> This problem is reproducible when the command is "exit 1". >> >> Once I set it to a real curl cmd the intermittent failures stopped and >> health checks worked as advertised. >> >> >> On Oct 8, 2015, at 12:45 PM, Jay Taylor wrote: >> >> Using the health-check following parameters: >> >> cmd="exit 1" >> delay=5.0 >> grace-period=10.0 >> interval=10.0 >> timeout=10.0 >> consecutiveFailures=3 >> >> Sometimes the tasks are successfully identified as failing and restarted, >> however other times the health-check command exits yet the task is left in >> a running state and the failure is ignored. >> >> Sample of failed Mesos task log: >> >> STDOUT: >> >> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" >>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --stop_timeout="0ns" >>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" >>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --stop_timeout="0ns" >>> Registered docker executor on mesos-worker2a >>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>> Launching health check process: /usr/libexec/mesos/mesos-health-check >>> --executor=(1)@192.168.225.59:38776 >>> --health_check_json={"command":{"shell":true,"value":"docker exec >>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 >>> sh -c \" exit 1 >>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0} >>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>> >>> *Health check process launched at pid: 7525* >>> *Received task health update, healthy: false**Received task health >>> update, healthy: false* >> >> >> >> STDERR: >> >> I1008 19:30:02.569856 7408 exec.cpp:134] Version: 0.26.0 >>> I1008 19:30:02.571815 7411 exec.cpp:208] Executor registered on slave >>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1 >>> WARNING: Your kernel does not support swap limit capabilities, memory >>> limited without swap. >>> WARNING: Logging before InitGoogleLogging() is written to STDERR >>> I1008 19:30:08.527354 7533 main.cpp:100] Ignoring failure as health >>> check still in grace period >>> *W1008 19:30:38.912325 7525 main.cpp:375] Health check failed Health >>> command check exited with status 1* >> >> >> Screenshot of the task still running despite health-check exited with >> status code 1: >> >> http://i.imgur.com/zx9GQuo.png >> >> The expected behavior when the health-check binary has exited w/ non-zero >> status is that the task would be killed and restarted (rather than >> continuing to run as outlined above). >> >> ----- >> Additional note: After hard-coding the "path" string of the health-check >> binary parent dir into b/src/docker/executor.cpp, I am able to at least >> test the functionality. The other issue of health-checks for docker tasks >> failing to start is still unresolved due to the unpropagated >> MESOS_LAUNCH_DIR issue. >> >> > > > -- > Best Regards, > Haosdent Huang > -- Best Regards, Haosdent Huang --f46d041825d80e9aa30521d18a7e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Could not reproduce your problem in my side. But I guess i= t maybe related to this ticket. MESOS-1613=C2=A0HealthCheckTest.ConsecutiveFailures i= s flaky

On F= ri, Oct 9, 2015 at 12:13 PM, haosdent <haosdent@gmail.com> = wrote:
I think it maybe = because health check exit before executor receive the=C2=A0TaskHealthStatus= . I would try "exit 1" and give your feedback later.

O= n Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <outtatime@gmail.com><= /span> wrote:
Foll= owing up on this:

This problem is reproducible whe= n the command is "exit 1".

Once I set it= to a real curl cmd the intermittent failures stopped and health checks wor= ked as advertised.


On Oct 8, 2015, at 12:45= PM, Jay Taylor <outtatime@gmail.com> wrote:

Using the health-check following parameters:

cmd=3D"exit 1"
delay=3D5.0
grace-period=3D10.0
interval=3D10.0
timeout=3D= 10.0
consecutiveFailures=3D3

Sometimes t= he tasks are successfully identified as failing and restarted, however othe= r times the health-check command exits yet the task is left in a running st= ate and the failure is ignored.

Sample of failed M= esos task log:

STDOUT:

--container=3D"mesos-= 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b66= 2" --docker=3D"docker" --docker_socket=3D"/var/run/dock= er.sock" --help=3D"false" --initialize_driver_logging=3D&quo= t;true" --logbufsecs=3D"0" --logging_level=3D"INFO"= ; --mapped_directory=3D"/mnt/mesos/sandbox" --quiet=3D"false= " --sandbox_directory=3D"/tmp/mesos/slaves/61373c0e-7349-4173-ab8= d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/exec= utors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-f= fba-47d8-8a8a-b6cb0c58b662" --stop_timeout=3D"0ns"
--cont= ainer=3D"mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-4= 7d8-8a8a-b6cb0c58b662" --docker=3D"docker" --docker_socket= =3D"/var/run/docker.sock" --help=3D"false" --initialize= _driver_logging=3D"true" --logbufsecs=3D"0" --logging_l= evel=3D"INFO" --mapped_directory=3D"/mnt/mesos/sandbox"= --quiet=3D"false" --sandbox_directory=3D"/tmp/mesos/slaves/= 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-16086243= 20-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2ca= b65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --stop_timeout=3D&qu= ot;0ns"
Registered docker executor on mesos-worker2a
Starting ta= sk hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
Launching healt= h check process: /usr/libexec/mesos/mesos-health-check --executor=3D(1)@192.168.225.59:38776= --health_check_json=3D{"command":{"shell":true,&qu= ot;value":"docker exec mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30= -S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 sh -c \" exit 1 \""= },"consecutive_failures":3,"delay_seconds":5.0,"gr= ace_period_seconds":10.0,"interval_seconds":10.0,"timeo= ut_seconds":10.0} --task_id=3Dhello-app_web-v3.d14ba30e-6401-4044-a97a= -86a2cab65631
Health check process launched at pid: 7525
Re= ceived task health update, healthy: false
Received task health up= date, healthy: false


<= div>STDERR:

I1008 19:30:02.569856 =C2=A07408 exec.cpp:134] Version: 0.26.= 0
I1008 19:30:02.571815 =C2=A07411 exec.cpp:208] Executor registered on = slave 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
WARNING: Your kernel does = not support swap limit capabilities, memory limited without swap.
WARNIN= G: Logging before InitGoogleLogging() is written to STDERR
I1008 19:30:0= 8.527354 =C2=A07533 main.cpp:100] Ignoring failure as health check still in= grace period
W1008 19:30:38.912325 =C2=A07525 main.cpp:375] Health c= heck failed Health command check exited with status 1

Screenshot of the task= still running despite health-check exited with status code 1:

The expected= behavior when the health-check binary has exited w/ non-zero status is tha= t the task would be killed and restarted (rather than continuing to run as = outlined above).

-----
Additional note: After = hard-coding the "path" string of the health-check binary parent d= ir into b/src/docker/executor.cpp, I am able to at least test the functiona= lity.=C2=A0 The other issue of health-checks for docker tasks failing to st= art is still unresolved due to the unpropagated MESOS_LAUNCH_DIR issue.



--
Best Regards,
Haosdent Huang



--
Best Regards,
Haosdent Huang
--f46d041825d80e9aa30521d18a7e--