Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7382218871 for ; Thu, 14 Jan 2016 22:15:08 +0000 (UTC) Received: (qmail 18978 invoked by uid 500); 14 Jan 2016 22:15:08 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 18930 invoked by uid 500); 14 Jan 2016 22:15:07 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 18920 invoked by uid 99); 14 Jan 2016 22:15:07 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2016 22:15:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 68E2DC0DB5 for ; Thu, 14 Jan 2016 22:15:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.466 X-Spam-Level: **** X-Spam-Status: No, score=4.466 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, KAM_LOTSOFHASH=0.25, NORMAL_HTTP_TO_IP=0.001, URIBL_BLOCKED=0.001, URI_HEX=1.313, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Bq8Dnm706yKh for ; Thu, 14 Jan 2016 22:14:54 +0000 (UTC) Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com [74.125.82.53]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 52CD931ABE for ; Thu, 14 Jan 2016 22:14:53 +0000 (UTC) Received: by mail-wm0-f53.google.com with SMTP id f206so1138710wmf.0 for ; Thu, 14 Jan 2016 14:14:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Sypy7a6HK6E5DEjILOt8UUCwtovp5A8p7Tf/Q/IwgVU=; b=CdFvXd9S0R5xEhNcnfh/JOSmaTIQbO2E75LXqc3R7kclrHdyiB9jS3ukfW/Ia1ywJI 3K5soMt8YcZNaMauchRJtWnWppCJO3G2okqD3pUalkVddISVyN/+zbaZCQw2K75OFfsN juo3iiHn1h0UOxHPjAYMFe/OBF/KWLylV3b2ziqnttoDER7YkvYPE3O3gY4aN7Dh3FdM dratTSCcyOO7fncO5cRsWN7L4Fx4aT4WxC1bdMhJj0SmnRGdCLfXTw8VPWx6dq1ACeId vVvoJ9Va8/7FpAEVATsCtJM0ZbxwzISWV2JfzYuFfO7qtRNDh/GQxgx5/5o3vrq/rb4Z jWZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=Sypy7a6HK6E5DEjILOt8UUCwtovp5A8p7Tf/Q/IwgVU=; b=QQ4LnI5vsq8iOtyE52/q58VQGpHla8RsAfu8yn+UIFaPIQZIGwwuu+4jhA4lnFPlkB poNSagTnBe6xEDq4CQ8tV8cr9NIxLIUy0lChLzAn7quPGjYpeLSUmtdqzfHpIf9S4aH2 gsSdI41B25hO4pxeZVT9yWo3wO2JXPSVlxNeflP3LYbb/4EPlYASAbX4WZAhIbwm17SW 5FHlHMOTOc+59gtzLEmR5iLVAwYXQkPJyVcmdYEXS4cMqHZP5qlJhBv443sb4Z+UXIbM obTrS/mZrHwqxaV2yTW6JrfNZ+1grfxj8FI+LNJNg2gdrspAhv15cJliMzycaBGR+31x 4IhA== X-Gm-Message-State: ALoCoQmjvMzU/fe2PMExGUU4c9F4bmVw1lYFzspaw7Mmkx9Mn21A4vurL9m1fqWgc0FsrysGZFezwd4UkyKNl2H2zRKv36ceeQ== MIME-Version: 1.0 X-Received: by 10.28.184.136 with SMTP id i130mr32592944wmf.96.1452809692991; Thu, 14 Jan 2016 14:14:52 -0800 (PST) Received: by 10.27.78.212 with HTTP; Thu, 14 Jan 2016 14:14:52 -0800 (PST) In-Reply-To: References: Date: Thu, 14 Jan 2016 17:14:52 -0500 Message-ID: Subject: Re: Help needed (alas, urgently) From: Paul Bell To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a114b3070ad59410529529d20 --001a114b3070ad59410529529d20 Content-Type: text/plain; charset=UTF-8 I spoke to soon, I'm afraid. Next time I did the stop (with zero timeout), I see the same phenomenon: a mongo container showing repeated: killing docker task shutting down What else can I try? Thank you. On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell wrote: > Hi Tim, > > I set docker_stop_timeout to zero as you asked. I am pleased to report > (though a bit fearful about being pleased) that this change seems to have > shut everyone down pretty much instantly. > > Can you explain what's happening, e.g., does docker_stop_timeout=0 cause > the immediate use of "kill -9" as opposed to "kill -2"? > > I will keep testing the behavior. > > Thank you. > > -Paul > > On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell wrote: > >> Hi Tim, >> >> Things have gotten slightly odder (if that's possible). When I now start >> the application 5 or so containers, only one "ecxconfigdb" gets started - >> and even he took a few tries. That is, I see him failing, moving to >> deploying, then starting again. But I've no evidence (no STDOUT, and no >> docker ctr logs) that show why. >> >> In any event, ecxconfigdb does start. Happily, when I try to stop the >> application I am seeing the phenomena I posted before: killing docker task, >> shutting down repeated many times. The UN-stopped container is now running >> at 100% CPU. >> >> I will try modifying docker_stop_timeout. Back shortly >> >> Thanks again. >> >> -Paul >> >> PS: what do you make of the "broken pipe" error in the docker.log? >> >> *from /var/log/upstart/docker.log* >> >> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json >> [34mINFO [0m[3054] GET >> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >> [31mERRO [0m[3054] Handler for GET >> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >> returned error: No such image: >> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >> [31mERRO [0m[3054] HTTP Error [31merr >> [0m=No such image: >> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >> [31mstatusCode [0m=404 >> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >> [34mINFO [0m[3054] POST >> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >> [34mINFO [0m[3054] POST >> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1&stdout=1&stream=1 >> [34mINFO [0m[3054] POST >> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET >> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >> [34mINFO [0m[3111] GET /v1.21/containers/json >> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json >> [34mINFO [0m[3120] GET >> /v1.21/containers/cf7/logs?stderr=1&stdout=1&tail=all >> [34mINFO [0m[3153] GET /containers/json >> [34mINFO [0m[3153] GET >> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3153] GET >> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json >> [34mINFO [0m[3153] GET >> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json >> [34mINFO [0m[3175] GET /containers/json >> [34mINFO [0m[3175] GET >> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3175] GET >> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json >> [34mINFO [0m[3175] GET >> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json >> * [34mINFO [0m[3175] POST >> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop* >> ?t=15 >> * [31mERRO [0m[3175] attach: stdout: write unix @: broken pipe * >> * [34mINFO [0m[3190] Container >> cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to >> exit within 15 seconds of SIGTERM - using the force * >> * [34mINFO [0m[3200] Container cf7fc7c48324 failed to exit within 10 >> seconds of kill - trying direct SIGKILL * >> >> *STDOUT from Mesos:* >> >> *--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b" >> *--docker="/usr/local/ecxmcc/weaveShim" --help="false" >> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b" >> --stop_timeout="15secs" >> --container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b" >> --docker="/usr/local/ecxmcc/weaveShim" --help="false" >> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b" >> --stop_timeout="15secs" >> Registered docker executor on 71.100.202.99 >> Starting task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285 >> 2016-01-14T20:45:38.613+0000 [initandlisten] MongoDB starting : pid=1 >> port=27017 dbpath=/data/db 64-bit host=ecxconfigdb >> 2016-01-14T20:45:38.614+0000 [initandlisten] db version v2.6.8 >> 2016-01-14T20:45:38.614+0000 [initandlisten] git version: >> 3abc04d6d4f71de00b57378e3277def8fd7a6700 >> 2016-01-14T20:45:38.614+0000 [initandlisten] build info: Linux >> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 >> UTC 2014 x86_64 BOOST_LIB_VERSION=1_49 >> 2016-01-14T20:45:38.614+0000 [initandlisten] allocator: tcmalloc >> 2016-01-14T20:45:38.614+0000 [initandlisten] options: { storage: { >> journal: { enabled: true } } } >> 2016-01-14T20:45:38.616+0000 [initandlisten] journal dir=/data/db/journal >> 2016-01-14T20:45:38.616+0000 [initandlisten] recover : no journal files >> present, no recovery needed >> 2016-01-14T20:45:39.006+0000 [initandlisten] waiting for connections on >> port 27017 >> 2016-01-14T20:46:38.975+0000 [clientcursormon] mem (MB) res:77 virt:12942 >> 2016-01-14T20:46:38.975+0000 [clientcursormon] mapped (incl journal >> view):12762 >> 2016-01-14T20:46:38.975+0000 [clientcursormon] connections:0 >> Killing docker task >> Shutting down >> Killing docker task >> Shutting down >> Killing docker task >> Shutting down >> >> On Thu, Jan 14, 2016 at 3:38 PM, Paul Bell wrote: >> >>> Hey Tim, >>> >>> Thank you very much for your reply. >>> >>> Yes, I am in the midst of trying to reproduce the problem. If successful >>> (so to speak), I will do as you ask. >>> >>> Cordially, >>> >>> Paul >>> >>> On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen wrote: >>> >>>> Hi Paul, >>>> >>>> Looks like we've already issued the docker stop as you seen in the ps >>>> output, but the containers are still running. Can you look at the Docker >>>> daemon logs and see what's going on there? >>>> >>>> And also can you also try to modify docker_stop_timeout to 0 so that we >>>> SIGKILL the containers right away, and see if this still happens? >>>> >>>> Tim >>>> >>>> >>>> >>>> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell wrote: >>>> >>>>> Hi All, >>>>> >>>>> It's been quite some time since I've posted here and that's chiefly >>>>> because up until a day or two ago, things were working really well. >>>>> >>>>> I actually may have posted about this some time back. But then the >>>>> problem seemed more intermittent. >>>>> >>>>> In summa, several "docker stops" don't work, i.e., the containers are >>>>> not stopped. >>>>> >>>>> Deployment: >>>>> >>>>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19 >>>>> Zookeeper >>>>> Mesos-master (0.23.0) >>>>> Mesos-slave (0.23.0) >>>>> Marathon (0.10.0) >>>>> Docker 1.9.1 >>>>> Weave 1.1.0 >>>>> Our application contains which include >>>>> MongoDB (4) >>>>> PostGres >>>>> ECX (our product) >>>>> >>>>> The only thing that's changed at all in the config above is the >>>>> version of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve >>>>> the problem. >>>>> >>>>> >>>>> My automater program stops the application by sending Marathon an >>>>> "http delete" for each running up. Every now & then (reliably reproducible >>>>> today) not all containers get stopped. Most recently, 3 containers failed >>>>> to stop. >>>>> >>>>> Here are the attendant phenomena: >>>>> >>>>> Marathon shows the 3 applications in deployment mode (presumably >>>>> "deployment" in the sense of "stopping") >>>>> >>>>> *ps output:* >>>>> >>>>> root@71:~# ps -ef | grep docker >>>>> root 3823 1 0 13:55 ? 00:00:02 /usr/bin/docker daemon >>>>> -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243 >>>>> root 4967 1 0 13:57 ? 00:00:01 /usr/sbin/mesos-slave >>>>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos >>>>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim >>>>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins >>>>> --hostname=71.100.202.99 --ip=71.100.202.99 >>>>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443] >>>>> root 5263 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>> tcp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 >>>>> -container-port 6783 >>>>> root 5271 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>> udp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 >>>>> -container-port 6783 >>>>> root 5279 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>> tcp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 >>>>> -container-port 53 >>>>> root 5287 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>> udp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 >>>>> -container-port 53 >>>>> root 7119 4967 0 14:00 ? 00:00:01 mesos-docker-executor >>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 >>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false >>>>> --mapped_directory=/mnt/mesos/sandbox >>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2 >>>>> --stop_timeout=15secs >>>>> root 7378 4967 0 14:00 ? 00:00:01 mesos-docker-executor >>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false >>>>> --mapped_directory=/mnt/mesos/sandbox >>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >>>>> --stop_timeout=15secs >>>>> root 7640 4967 0 14:01 ? 00:00:01 mesos-docker-executor >>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 >>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false >>>>> --mapped_directory=/mnt/mesos/sandbox >>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298 >>>>> --stop_timeout=15secs >>>>> *root 9696 9695 0 14:06 ? 00:00:00 /usr/bin/docker stop >>>>> -t 15 >>>>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298* >>>>> *root 9709 9708 0 14:06 ? 00:00:00 /usr/bin/docker stop >>>>> -t 15 >>>>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89* >>>>> *root 9720 9719 0 14:06 ? 00:00:00 /usr/bin/docker stop >>>>> -t 15 >>>>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2* >>>>> >>>>> *docker ps output:* >>>>> >>>>> root@71:~# docker ps >>>>> CONTAINER ID IMAGE COMMAND >>>>> CREATED STATUS PORTS >>>>> NAMES >>>>> 5abafbfe7de2 mongo:2.6.8 "/w/w /entrypoint.sh >>>>> " 11 minutes ago Up 11 minutes 27017/tcp >>>>> >>>>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 >>>>> a8449682ca2e mongo:2.6.8 "/w/w /entrypoint.sh >>>>> " 11 minutes ago Up 11 minutes 27017/tcp >>>>> >>>>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >>>>> 3b956457374b mongo:2.6.8 "/w/w /entrypoint.sh >>>>> " 11 minutes ago Up 11 minutes 27017/tcp >>>>> >>>>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 >>>>> 4c1588bb3d4b weaveworks/weaveexec:v1.1.0 >>>>> "/home/weave/weavepro" 15 minutes ago Up 15 minutes >>>>> >>>>> weaveproxy >>>>> a26a0363584b weaveworks/weave:v1.1.0 "/home/weave/weaver >>>>> -" 15 minutes ago Up 15 minutes 172.17.0.1:53->53/tcp, >>>>> 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp, 0.0.0.0:6783->6783/udp >>>>> weave >>>>> >>>>> *from /var/log/syslog:* >>>>> >>>>> >>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405 5002 >>>>> master.cpp:2944] Asked to kill task >>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>> *Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459 5002 >>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at >>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to >>>>> kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>> * >>>>> *Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729 5042 >>>>> slave.cpp:1755] Asked to kill task >>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000* >>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295 5004 >>>>> http.cpp:283] HTTP GET for /master/state.json from 172.19.15.61:65038 >>>>> with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 >>>>> (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36' >>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904 5001 >>>>> master.cpp:2944] Asked to kill task >>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935 5001 >>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at >>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task >>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136 5041 >>>>> slave.cpp:1755] Asked to kill task >>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932 4998 >>>>> master.cpp:2944] Asked to kill task >>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958 4998 >>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at >>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task >>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151 5038 >>>>> slave.cpp:1755] Asked to kill task >>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>> Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009 5001 >>>>> master.cpp:4290] Sending 1 offers to framework >>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO >>>>> started processing 1 offers, launching at most 1 tasks per offer and 1000 >>>>> tasks in total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132) >>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO >>>>> Offer [20160114-135722-1674208327-5050-4917-O128]. Decline with default >>>>> filter refuseSeconds (use --decline_offer_duration to configure) >>>>> (mesosphere.marathon.tasks.IterativeOfferMatcher$:231) >>>>> >>>>> >>>>> *from Mesos STDOUT of unstopped container:* >>>>> >>>>> Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 >>>>> 2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting : pid=1 >>>>> port=27019 dbpath=/data/db/config master=1 64-bit host=mongoconfig >>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8 >>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] git version: >>>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700 >>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux >>>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 >>>>> 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49 >>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc >>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding: { >>>>> clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config", journal: >>>>> { enabled: true } } } >>>>> 2016-01-14T19:01:10.999+0000 [initandlisten] journal >>>>> dir=/data/db/config/journal >>>>> 2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal >>>>> files present, no recovery needed >>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] warning: >>>>> ClientCursor::staticYield can't unlock b/c of recursive lock ns: top: { >>>>> opid: 11, active: true, secs_running: 0, microsecs_running: 36, op: >>>>> "query", ns: "local.oplog.$main", query: { query: {}, orderby: { $natural: >>>>> -1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId: >>>>> "0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 0, >>>>> lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } } >>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] waiting for connections >>>>> on port 27019 >>>>> 2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted from >>>>> 10.2.0.3:51189 #1 (1 connection now open) >>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from >>>>> 10.2.0.3:51190 #2 (2 connections now open) >>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from >>>>> 10.2.0.3:51191 #3 (3 connections now open) >>>>> 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation detected, >>>>> adding sharding hook to enable versioning and authentication to remote >>>>> servers >>>>> 2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>> 2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted from >>>>> 10.2.0.3:51193 #4 (4 connections now open) >>>>> 2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted from >>>>> 10.2.0.3:51194 #5 (5 connections now open) >>>>> 2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189 (4 >>>>> connections now open) >>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59 virt:385 >>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] mapped (incl journal >>>>> view):192 >>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] connections:4 >>>>> 2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72 virt:385 >>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon] mapped (incl journal >>>>> view):192 >>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon] connections:4 >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> Shutting down >>>>> Killing docker task >>>>> >>>>> Most disturbing in all of this is that while I can stop the >>>>> deployments in Marathon (which properly end the "docker stop" commands >>>>> visible in ps output), I can not bounce docker, not by Upstart, nor via >>>>> kill command). Ultimately, I have to reboot the VM. >>>>> >>>>> FWIW, the 3 mongod containers (apparently stuck in their Killing >>>>> docker task/shutting down loop) are running at 100%CPU as evinced by both >>>>> "docker stats" and "top". >>>>> >>>>> I would truly be grateful for some guidance on this - even a mere >>>>> work-around would be appreciated. >>>>> >>>>> Thank you. >>>>> >>>>> -Paul >>>>> >>>> >>>> >>> >> > --001a114b3070ad59410529529d20 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I spoke to soon, I'm afraid.

Next t= ime I did the stop (with zero timeout), I see the same phenomenon: a mongo = container showing repeated:

=
killing docker task
shutting down=C2=A0

What else can I try?

T= hank you.

On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <= ;arachweb@gmail.com= > wrote:
H= i Tim,

I set=C2=A0docke= r_stop_timeout to zero as you asked. I am pleased to report (though a bit f= earful about being pleased) that this change seems to have shut everyone do= wn pretty much instantly.

Can you explain wha= t's happening, e.g., does=C2=A0= docker_stop_timeout=3D0 cause the immediate use of "kill -9" as o= pposed to "kill -2"?

I will keep te= sting the behavior.

<= /span>
Thank you.

-Paul=

On Thu, Jan 14, 2016 at 3:= 59 PM, Paul Bell <arachweb@gmail.com> wrote:
Hi Tim,

Things have= gotten slightly odder (if that's possible). When I now start the appli= cation 5 or so containers, only one "ecxconfigdb" gets started - = and even he took a few tries. That is, I see him failing, moving to deployi= ng, then starting again. But I've no evidence (no STDOUT, and no docker= ctr logs) that show why.

In any event, ecxconfigd= b does start. Happily, when I try to stop the application I am seeing the p= henomena I posted before: killing docker task, shutting down repeated many = times. The UN-stopped container is now running at 100% CPU.

<= /div>
I will try modifying docker_stop_timeout. Back shortly
=
Thanks again.

-Paul

<= /div>
PS: what do you make of the "broken pipe" error in the = docker.log?

from /var/log/upstart/docker.log

[34mINFO [0m[3054] GET /v1.15/images/mongo:= 2.6.8/json =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
[34mINFO [0m= [3054] GET /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c3= 5d59-1318-4a96-b850-b0b788815f1b/json=C2=A0
[31mERRO [0m[3054] H= andler for GET /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.= c5c35d59-1318-4a96-b850-b0b788815f1b/json returned error: No such image: me= sos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b7888= 15f1b=C2=A0
[31mERRO [0m[3054] HTTP Error =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 [31merr [0m=3DNo such image: mesos-20160114= -153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b [31ms= tatusCode [0m=3D404
[34mINFO [0m[3054] GET /v1.15/containers/wea= ve/json =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
[34mINFO= [0m[3054] POST /v1.21/containers/create?name=3Dmesos-20160114-153418-16742= 08327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b=C2=A0
[3= 4mINFO [0m[3054] POST /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bf= be936c74eed1fc9af6f70653a1d47/attach?stderr=3D1&stdout=3D1&stream= =3D1=C2=A0
[34mINFO [0m[3054] POST /v1.21/containers/cf7fc7c4832= 48e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start=C2=A0
[34mINFO [0m[3054] GET /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f= 2bfbe936c74eed1fc9af6f70653a1d47/json=C2=A0
[34mINFO [0m[3054] G= ET /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f= 70653a1d47/json=C2=A0
[34mINFO [0m[3054] GET /v1.15/containers/w= eave/json =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
[34mIN= FO [0m[3054] GET /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936= c74eed1fc9af6f70653a1d47/json=C2=A0
[34mINFO [0m[3054] GET /v1.1= 5/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d= 47/json=C2=A0
[34mINFO [0m[3054] GET /v1.15/containers/weave/jso= n =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
[34mINFO [0m[3= 054] GET /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1f= c9af6f70653a1d47/json=C2=A0
[34mINFO [0m[3054] GET /v1.21/contai= ners/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-= b0b788815f1b/json=C2=A0
[34mINFO [0m[3111] GET /v1.21/containers= /json =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0<= /div>
[34mINFO [0m[3120] GET /v1.21/containers/cf7/json =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
[34mINFO [0m[3120] GET = /v1.21/containers/cf7/logs?stderr=3D1&stdout=3D1&tail=3Dall=C2=A0
[34mINFO [0m[3153] GET /containers/json =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
[34mINFO [0m[3153] GET /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe9= 36c74eed1fc9af6f70653a1d47/json=C2=A0
[34mINFO [0m[3153] GET /co= ntainers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/j= son=C2=A0
[34mINFO [0m[3153] GET /containers/b9e9b79a8d431455bfc= aafca59223017b2470a47a294075d656eeffdaaefad33/json=C2=A0
[34mINF= O [0m[3175] GET /containers/json =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
[34mINFO [0m[317= 5] GET /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70= 653a1d47/json=C2=A0
[34mINFO [0m[3175] GET /containers/56111722e= f83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json=C2=A0
[34mINFO [0m[3175] GET /containers/b9e9b79a8d431455bfcaafca59223017b2470= a47a294075d656eeffdaaefad33/json=C2=A0
[34mINFO [0m[3175] POS= T /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-= 1318-4a96-b850-b0b788815f1b/stop?t=3D15=C2=A0
[31mERRO [0= m[3175] attach: stdout: write unix @: broken pipe =C2=A0 =C2=A0
[34mINFO [0m[3190] Container cf7fc7c483248e30f1dbb5990ce8874f2bfbe93= 6c74eed1fc9af6f70653a1d47 failed to exit within 15 seconds of SIGTERM - usi= ng the force=C2=A0
[34mINFO [0m[3200] Container cf7fc7c48= 324 failed to exit within 10 seconds of kill - trying direct SIGKILL=C2=A0<= /b>

STDOUT from Mesos:

=
--container=3D"mesos-20160114-153418-1674208327-505= 0-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b" --docker=3D"/= usr/local/ecxmcc/weaveShim" --help=3D"false" --initialize_dr= iver_logging=3D"true" --logbufsecs=3D"0" --logging_leve= l=3D"INFO" --mapped_directory=3D"/mnt/mesos/sandbox" --= quiet=3D"false" --sandbox_directory=3D"/tmp/mesos/slaves/201= 60114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-= 5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/r= uns/c5c35d59-1318-4a96-b850-b0b788815f1b" --stop_timeout=3D"15sec= s"
--container=3D"mesos-20160114-153418-1674208327-5050= -3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b" --docker=3D"/usr/l= ocal/ecxmcc/weaveShim" --help=3D"false" --initialize_driver_= logging=3D"true" --logbufsecs=3D"0" --logging_level=3D&= quot;INFO" --mapped_directory=3D"/mnt/mesos/sandbox" --quiet= =3D"false" --sandbox_directory=3D"/tmp/mesos/slaves/20160114= -153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-= 3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c= 5c35d59-1318-4a96-b850-b0b788815f1b" --stop_timeout=3D"15secs&quo= t;
Registered docker executor on 71.100.202.99
Starting= task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285
2016-01-14= T20:45:38.613+0000 [initandlisten] MongoDB starting : pid=3D1 port=3D27017 = dbpath=3D/data/db 64-bit host=3Decxconfigdb
2016-01-14T20:45:38.6= 14+0000 [initandlisten] db version v2.6.8
2016-01-14T20:45:38.614= +0000 [initandlisten] git version: 3abc04d6d4f71de00b57378e3277def8fd7a6700=
2016-01-14T20:45:38.614+0000 [initandlisten] build info: Linux <= a href=3D"http://build5.nj1.10gen.cc" target=3D"_blank">build5.nj1.10gen.cc= 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BO= OST_LIB_VERSION=3D1_49
2016-01-14T20:45:38.614+0000 [initandliste= n] allocator: tcmalloc
2016-01-14T20:45:38.614+0000 [initandliste= n] options: { storage: { journal: { enabled: true } } }
2016-01-1= 4T20:45:38.616+0000 [initandlisten] journal dir=3D/data/db/journal
2016-01-14T20:45:38.616+0000 [initandlisten] recover : no journal files p= resent, no recovery needed
2016-01-14T20:45:39.006+0000 [initandl= isten] waiting for connections on port 27017
2016-01-14T20:46:38.= 975+0000 [clientcursormon] mem (MB) res:77 virt:12942
2016-01-14T= 20:46:38.975+0000 [clientcursormon] =C2=A0mapped (incl journal view):12762<= /div>
2016-01-14T20:46:38.975+0000 [clientcursormon] =C2=A0connections:= 0
Killing docker task
Shutting down
Kil= ling docker task
Shutting down
Killing docker task
Shutting down

On Thu, Jan 14, 2016 at 3:38 PM, Paul = Bell <arachweb@gmail.com> wrote:
Hey Tim,

Thank you very much fo= r your reply.

Yes, I am in the midst of trying to = reproduce the problem. If successful (so to speak), I will do as you ask.

Cordially,

Paul

On Thu= , Jan 14, 2016 at 3:19 PM, Tim Chen <tim@mesosphere.io> wrot= e:
Hi Paul,

Looks like we've already issued the docker stop as you seen in t= he ps output, but the containers are still running. Can you look at the Doc= ker daemon logs and see what's going on there?

And also can you also try to modify docker_stop_timeout to 0 so that we SI= GKILL the containers right away, and see if this still happens?
=

Tim


=

On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <ara= chweb@gmail.com> wrote:
Hi All,

It's been quite som= e time since I've posted here and that's chiefly because up until a= day or two ago, things were working really well.

= I actually may have posted about this some time back. But then the problem = seemed more intermittent.=C2=A0

In summa, several = "docker stops" don't work, i.e., the containers are not stopp= ed.=C2=A0

Deployment:

one= Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
Zookeeper
Mesos-master (0.23.0)
Mesos-slave (0.23.0)
Marathon (= 0.10.0)
Docker 1.9.1
Weave 1.1.0
Our applicat= ion contains which include
=C2=A0 =C2=A0 MongoDB (4)
= =C2=A0 =C2=A0 PostGres
=C2=A0 =C2=A0 ECX (our product)
=
The only thing that's changed at all in the config above= is the version of Docker. Used to be 1.6.2 but I today upgraded it hoping = to solve the problem.


My automater = program stops the application by sending Marathon an "http delete"= ; for each running up. Every now & then (reliably reproducible today) n= ot all containers get stopped. Most recently, 3 containers failed to stop.<= /div>

Here are the attendant phenomena:

Marathon shows the 3 applications in deployment mode (presumably &= quot;deployment" in the sense of "stopping")

<= /div>
ps output:

root@71:~# ps -ef | gr= ep docker
root =C2=A0 =C2=A0 =C2=A03823 =C2=A0 =C2=A0 1 =C2=A00 1= 3:55 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:02 /usr/bin/docker daemon -H unix:/= //var/run/docker.sock -H tcp://0.0.0.0:4243
root =C2=A0 =C2=A0 =C2=A04967 =C2=A0 =C2= =A0 1 =C2=A00 13:57 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:01 /usr/sbin/mesos-s= lave --master=3Dzk://71.100.202.99:2181/mesos --log_dir=3D/var/log/mesos --container= izers=3Ddocker,mesos --docker=3D/usr/local/ecxmcc/weaveShim --docker_stop_t= imeout=3D15secs --executor_registration_timeout=3D5mins --hostname=3D71.100= .202.99 --ip=3D71.100.202.99 --attributes=3DhostType:ecx,shard1 --resources= =3Dports:[31000-31999,8443-8443]
root =C2=A0 =C2=A0 =C2=A05263 = =C2=A03823 =C2=A00 13:57 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:00 docker-proxy= -proto tcp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -cont= ainer-port 6783
root =C2=A0 =C2=A0 =C2=A05271 =C2=A03823 =C2=A00 = 13:57 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:00 docker-proxy -proto udp -host-i= p 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port 6783
root =C2=A0 =C2=A0 =C2=A05279 =C2=A03823 =C2=A00 13:57 ? =C2=A0 =C2= =A0 =C2=A0 =C2=A000:00:00 docker-proxy -proto tcp -host-ip 172.17.0.1 -host= -port 53 -container-ip 172.17.0.2 -container-port 53
root =C2=A0 = =C2=A0 =C2=A05287 =C2=A03823 =C2=A00 13:57 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:= 00:00 docker-proxy -proto udp -host-ip 172.17.0.1 -host-port 53 -container-= ip 172.17.0.2 -container-port 53
root =C2=A0 =C2=A0 =C2=A07119 = =C2=A04967 =C2=A00 14:00 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:01 mesos-docker= -executor --container=3Dmesos-20160114-135722-1674208327-5050-4917-S0.bfc5a= 419-30f8-43f7-af2f-5582394532f2 --docker=3D/usr/local/ecxmcc/weaveShim --he= lp=3Dfalse --mapped_directory=3D/mnt/mesos/sandbox --sandbox_directory=3D/t= mp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114= -103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-= 8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2 --stop_timeout= =3D15secs
root =C2=A0 =C2=A0 =C2=A07378 =C2=A04967 =C2=A00 14:00 = ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:01 mesos-docker-executor --container=3Dm= esos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a9= 1f7b89 --docker=3D/usr/local/ecxmcc/weaveShim --help=3Dfalse --mapped_direc= tory=3D/mnt/mesos/sandbox --sandbox_directory=3D/tmp/mesos/slaves/20160114-= 135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3= 293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs= /9b700cdc-3d29-49b7-a7fc-e543a91f7b89 --stop_timeout=3D15secs
roo= t =C2=A0 =C2=A0 =C2=A07640 =C2=A04967 =C2=A00 14:01 ? =C2=A0 =C2=A0 =C2=A0 = =C2=A000:00:01 mesos-docker-executor --container=3Dmesos-20160114-135722-16= 74208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 --docker=3D/usr/= local/ecxmcc/weaveShim --help=3Dfalse --mapped_directory=3D/mnt/mesos/sandb= ox --sandbox_directory=3D/tmp/mesos/slaves/20160114-135722-1674208327-5050-= 4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mong= oconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0= 631edea4298 --stop_timeout=3D15secs
root =C2=A0 =C2=A0 =C2=A09= 696 =C2=A09695 =C2=A00 14:06 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:00 /usr/bin= /docker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-c= fc9-424d-b341-0631edea4298
root =C2=A0 =C2=A0 =C2=A09709 = =C2=A09708 =C2=A00 14:06 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:00 /usr/bin/doc= ker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-= 49b7-a7fc-e543a91f7b89
root =C2=A0 =C2=A0 =C2=A09720 =C2= =A09719 =C2=A00 14:06 ? =C2=A0 =C2=A0 =C2=A0 =C2=A000:00:00 /usr/bin/docker= stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f= 7-af2f-5582394532f2

docker ps output:

root@71:~# docker ps
CONTAINER ID =C2=A0 = =C2=A0 =C2=A0 =C2=A0IMAGE =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 COMMAND =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0CREATED =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 STATUS =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0PORTS =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0NAMES
5abafbfe7de2 =C2=A0 =C2=A0 =C2=A0 =C2=A0mongo:= 2.6.8 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "= /w/w /entrypoint.sh " =C2=A0 11 minutes ago =C2=A0 =C2=A0 =C2=A0Up 11 = minutes =C2=A0 =C2=A0 =C2=A0 27017/tcp =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mesos-20160114-135722-16742083= 27-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
a8449682ca2e= =C2=A0 =C2=A0 =C2=A0 =C2=A0mongo:2.6.8 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 "/w/w /entrypoint.sh " =C2=A0 11 minu= tes ago =C2=A0 =C2=A0 =C2=A0Up 11 minutes =C2=A0 =C2=A0 =C2=A0 27017/tcp = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc= -e543a91f7b89
3b956457374b =C2=A0 =C2=A0 =C2=A0 =C2=A0mongo:2.6.8= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "/w/w = /entrypoint.sh " =C2=A0 11 minutes ago =C2=A0 =C2=A0 =C2=A0Up 11 minut= es =C2=A0 =C2=A0 =C2=A0 27017/tcp =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mesos-20160114-135722-1674208327-5050= -4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
4c1588bb3d4b =C2=A0= =C2=A0 =C2=A0 =C2=A0weaveworks/weaveexec:v1.1.0 =C2=A0 "/home/weave/w= eavepro" =C2=A0 15 minutes ago =C2=A0 =C2=A0 =C2=A0Up 15 minutes =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0weaveproxy
=
a26a0363584b =C2=A0 =C2=A0 =C2=A0 =C2=A0weaveworks/weave:v1.1.0 =C2=A0= =C2=A0 =C2=A0 "/home/weave/weaver -" =C2=A0 15 minutes ago =C2= =A0 =C2=A0 =C2=A0Up 15 minutes =C2=A0 =C2=A0 =C2=A0 172.17.0.1:53->53/tc= p, 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp, 0.0.0.0:6783->67= 83/udp =C2=A0 weave

from /var/log/syslog:


Jan 14 14:10:02 71 mesos-master[4917= ]: I0114 14:10:02.356405 =C2=A05002 master.cpp:2944] Asked to kill task mon= goconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-= 1674208327-5050-3293-0000
Jan 14 14:10:02 71 mesos-master[4917= ]: I0114 14:10:02.356459 =C2=A05002 master.cpp:3034] Telling slave 20160114= -135722-1674208327-5050-4917-S0 at slave(1)@71.100.202.99:5051 (71.100.202.99) to kill tas= k mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-10= 3414-1674208327-5050-3293-0000 (marathon) at s= cheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729 =C2= =A05042 slave.cpp:1755] Asked to kill task mongoconfig.2cb9163b-baf1-11e5-8= c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000=
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295 =C2= =A05004 http.cpp:283] HTTP GET for /master/state.json from 172.19.15.61:65038 with User-Ag= ent=3D'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, l= ike Gecko) Chrome/46.0.2490.86 Safari/537.36'
Jan 14 14:10:02= 71 mesos-master[4917]: I0114 14:10:02.425904 =C2=A05001 master.cpp:2944] A= sked to kill task ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of fr= amework 20160114-103414-1674208327-5050-3293-0000
Jan 14 14:10:02= 71 mesos-master[4917]: I0114 14:10:02.425935 =C2=A05001 master.cpp:3034] T= elling slave 20160114-135722-1674208327-5050-4917-S0 at slave(1)@71.100.202.99:5051 (71.10= 0.202.99) to kill task ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 = of framework 20160114-103414-1674208327-5050-3293-0000 (marathon) at scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.10= 0.202.99:46167
Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14= :10:02.426136 =C2=A05041 slave.cpp:1755] Asked to kill task ecxcatalogdbs1.= 25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-167420832= 7-5050-3293-0000
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:= 10:02.435932 =C2=A04998 master.cpp:2944] Asked to kill task ecxconfigdb.1e6= e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5= 050-3293-0000
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:= 02.435958 =C2=A04998 master.cpp:3034] Telling slave 20160114-135722-1674208= 327-5050-4917-S0 at slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task ecxconfigdb= .1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-16742083= 27-5050-3293-0000 (marathon) at scheduler-2a46= bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
Jan 14 1= 4:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151 =C2=A05038 slave.cpp:17= 55] Asked to kill task ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of = framework 20160114-103414-1674208327-5050-3293-0000
Jan 14 14:10:= 03 71 mesos-master[4917]: I0114 14:10:03.759009 =C2=A05001 master.cpp:4290]= Sending 1 offers to framework 20160114-103414-1674208327-5050-3293-0000 (m= arathon) at scheduler-2a46bd53-a6fd-4202-b271-= d633064acf7f@71.100.202.99:46167
Jan 14 14:10:03 71 marathon[= 4937]: [2016-01-14 14:10:03,765] INFO started processing 1 offers, launchin= g at most 1 tasks per offer and 1000 tasks in total (mesosphere.marathon.ta= sks.IterativeOfferMatcher$:132)
Jan 14 14:10:03 71 marathon[4937]= : [2016-01-14 14:10:03,766] INFO Offer [20160114-135722-1674208327-5050-491= 7-O128]. Decline with default filter refuseSeconds (use --decline_offer_dur= ation to configure) (mesosphere.marathon.tasks.IterativeOfferMatcher$:231)<= /div>


from Mesos STDOUT of unstopped c= ontainer:

Starting task mongoconfig.2cb9163b-b= af1-11e5-8c36-522bd4cc5ea9
2016-01-14T19:01:10.997+0000 [initandl= isten] MongoDB starting : pid=3D1 port=3D27019 dbpath=3D/data/db/config mas= ter=3D1 64-bit host=3Dmongoconfig
2016-01-14T19:01:10.998+0000 [i= nitandlisten] db version v2.6.8
2016-01-14T19:01:10.998+0000 [ini= tandlisten] git version: 3abc04d6d4f71de00b57378e3277def8fd7a6700
2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux build5.nj1.10gen.cc 2.6.3= 2-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VE= RSION=3D1_49
2016-01-14T19:01:10.998+0000 [initandlisten] allocat= or: tcmalloc
2016-01-14T19:01:10.998+0000 [initandlisten] options= : { sharding: { clusterRole: "configsvr" }, storage: { dbPath: &q= uot;/data/db/config", journal: { enabled: true } } }
2016-01= -14T19:01:10.999+0000 [initandlisten] journal dir=3D/data/db/config/journal=
2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journa= l files present, no recovery needed
2016-01-14T19:01:11.429+0000 = [initandlisten] warning: ClientCursor::staticYield can't unlock b/c of = recursive lock ns: =C2=A0top: { opid: 11, active: true, secs_running: 0, mi= crosecs_running: 36, op: "query", ns: "local.oplog.$main&quo= t;, query: { query: {}, orderby: { $natural: -1 } }, client: "0.0.0.0:0", desc: "in= itandlisten", threadId: "0x7f8f73075b40", locks: { ^: "= W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicr= os: {}, timeAcquiringMicros: {} } }
2016-01-14T19:01:11.429+0000 = [initandlisten] waiting for connections on port 27019
2016-01-14T= 19:01:17.405+0000 [initandlisten] connection accepted from 10.2.0.3:51189 #1 (1 connection now= open)
2016-01-14T19:01:17.413+0000 [initandlisten] connection ac= cepted from 10.2.0.3:51= 190 #2 (2 connections now open)
2016-01-14T19:01:17.413+0000 = [initandlisten] connection accepted from 10.2.0.3:51191 #3 (3 connections now open)
= 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation detected, addi= ng sharding hook to enable versioning and authentication to remote servers<= /div>
2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
<= div>2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
20= 16-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
2016-= 01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-= 14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T= 19:01:17.419+0000 [initandlisten] connection accepted from 10.2.0.3:51193 #4 (4 connections no= w open)
2016-01-14T19:01:17.420+0000 [initandlisten] connection a= ccepted from 10.2.0.3:5= 1194 #5 (5 connections now open)
2016-01-14T19:01:17.442+0000= [conn1] end connection 10.2.0.3:51189 (4 connections now open)
2016-01-14T19:02:11.= 285+0000 [clientcursormon] mem (MB) res:59 virt:385
2016-01-14T19= :02:11.285+0000 [clientcursormon] =C2=A0mapped (incl journal view):192
2016-01-14T19:02:11.285+0000 [clientcursormon] =C2=A0connections:4
2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72 virt= :385
2016-01-14T19:03:11.294+0000 [clientcursormon] =C2=A0mapped = (incl journal view):192
2016-01-14T19:03:11.294+0000 [clientcurso= rmon] =C2=A0connections:4
Killing docker task
Shutting = down
Killing docker task
Shutting down
Killin= g docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker = task
Shutting down
Killing docker task
Shutti= ng down
Killing docker task
Shutting down
Kil= ling docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing dock= er task
Shutting down
Killing docker task
Shu= tting down
Killing docker task

Most dist= urbing in all of this is that while I can stop the deployments in Marathon = (which properly end the "docker stop" commands visible in ps outp= ut), I can not bounce docker, not by Upstart, nor via kill command).=C2=A0 = Ultimately, I have to reboot the VM.

FWIW, the 3 m= ongod containers (apparently stuck in their Killing docker task/shutting do= wn loop) are running at 100%CPU as evinced by both "docker stats"= and "top".

I would truly be grateful fo= r some guidance on this - even a mere work-around would be appreciated.

Thank you.
-Paul





--001a114b3070ad59410529529d20--