mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Chen <...@mesosphere.io>
Subject Re: Help needed (alas, urgently)
Date Thu, 14 Jan 2016 20:19:15 GMT
Hi Paul,

Looks like we've already issued the docker stop as you seen in the ps
output, but the containers are still running. Can you look at the Docker
daemon logs and see what's going on there?

And also can you also try to modify docker_stop_timeout to 0 so that we
SIGKILL the containers right away, and see if this still happens?

Tim



On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arachweb@gmail.com> wrote:

> Hi All,
>
> It's been quite some time since I've posted here and that's chiefly
> because up until a day or two ago, things were working really well.
>
> I actually may have posted about this some time back. But then the problem
> seemed more intermittent.
>
> In summa, several "docker stops" don't work, i.e., the containers are not
> stopped.
>
> Deployment:
>
> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
> Zookeeper
> Mesos-master (0.23.0)
> Mesos-slave (0.23.0)
> Marathon (0.10.0)
> Docker 1.9.1
> Weave 1.1.0
> Our application contains which include
>     MongoDB (4)
>     PostGres
>     ECX (our product)
>
> The only thing that's changed at all in the config above is the version of
> Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
> problem.
>
>
> My automater program stops the application by sending Marathon an "http
> delete" for each running up. Every now & then (reliably reproducible today)
> not all containers get stopped. Most recently, 3 containers failed to stop.
>
> Here are the attendant phenomena:
>
> Marathon shows the 3 applications in deployment mode (presumably
> "deployment" in the sense of "stopping")
>
> *ps output:*
>
> root@71:~# ps -ef | grep docker
> root      3823     1  0 13:55 ?        00:00:02 /usr/bin/docker daemon -H
> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
> root      4967     1  0 13:57 ?        00:00:01 /usr/sbin/mesos-slave
> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
> --hostname=71.100.202.99 --ip=71.100.202.99
> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
> root      5263  3823  0 13:57 ?        00:00:00 docker-proxy -proto tcp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root      5271  3823  0 13:57 ?        00:00:00 docker-proxy -proto udp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root      5279  3823  0 13:57 ?        00:00:00 docker-proxy -proto tcp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root      5287  3823  0 13:57 ?        00:00:00 docker-proxy -proto udp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root      7119  4967  0 14:00 ?        00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
> --stop_timeout=15secs
> root      7378  4967  0 14:00 ?        00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --stop_timeout=15secs
> root      7640  4967  0 14:01 ?        00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
> --stop_timeout=15secs
> *root      9696  9695  0 14:06 ?        00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
> *root      9709  9708  0 14:06 ?        00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
> *root      9720  9719  0 14:06 ?        00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*
>
> *docker ps output:*
>
> root@71:~# docker ps
> CONTAINER ID        IMAGE                         COMMAND
>  CREATED             STATUS              PORTS
>                                                              NAMES
> 5abafbfe7de2        mongo:2.6.8                   "/w/w /entrypoint.sh "
> 11 minutes ago      Up 11 minutes       27017/tcp
>
>  mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
> a8449682ca2e        mongo:2.6.8                   "/w/w /entrypoint.sh "
> 11 minutes ago      Up 11 minutes       27017/tcp
>
>  mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> 3b956457374b        mongo:2.6.8                   "/w/w /entrypoint.sh "
> 11 minutes ago      Up 11 minutes       27017/tcp
>
>  mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
> 4c1588bb3d4b        weaveworks/weaveexec:v1.1.0   "/home/weave/weavepro"
> 15 minutes ago      Up 15 minutes
>                                                            weaveproxy
> a26a0363584b        weaveworks/weave:v1.1.0       "/home/weave/weaver -"
> 15 minutes ago      Up 15 minutes       172.17.0.1:53->53/tcp,
> 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp, 0.0.0.0:6783->6783/udp
> weave
>
> *from /var/log/syslog:*
>
>
> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405  5002
> master.cpp:2944] Asked to kill task
> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000
> *Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459  5002
> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
> slave(1)@71.100.202.99:5051 <http://71.100.202.99:5051> (71.100.202.99) to
> kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
> <http://scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167>*
> *Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729  5042
> slave.cpp:1755] Asked to kill task
> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000*
> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295  5004
> http.cpp:283] HTTP GET for /master/state.json from 172.19.15.61:65038
> with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
> (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904  5001
> master.cpp:2944] Asked to kill task
> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000
> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935  5001
> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136  5041
> slave.cpp:1755] Asked to kill task
> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000
> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932  4998
> master.cpp:2944] Asked to kill task
> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000
> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958  4998
> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151  5038
> slave.cpp:1755] Asked to kill task
> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
> 20160114-103414-1674208327-5050-3293-0000
> Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009  5001
> master.cpp:4290] Sending 1 offers to framework
> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO started
> processing 1 offers, launching at most 1 tasks per offer and 1000 tasks in
> total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132)
> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO Offer
> [20160114-135722-1674208327-5050-4917-O128]. Decline with default filter
> refuseSeconds (use --decline_offer_duration to configure)
> (mesosphere.marathon.tasks.IterativeOfferMatcher$:231)
>
>
> *from Mesos STDOUT of unstopped container:*
>
> Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9
> 2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting : pid=1
> port=27019 dbpath=/data/db/config master=1 64-bit host=mongoconfig
> 2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8
> 2016-01-14T19:01:10.998+0000 [initandlisten] git version:
> 3abc04d6d4f71de00b57378e3277def8fd7a6700
> 2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux
> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27
> UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
> 2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc
> 2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding: {
> clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config", journal:
> { enabled: true } } }
> 2016-01-14T19:01:10.999+0000 [initandlisten] journal
> dir=/data/db/config/journal
> 2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal files
> present, no recovery needed
> 2016-01-14T19:01:11.429+0000 [initandlisten] warning:
> ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: {
> opid: 11, active: true, secs_running: 0, microsecs_running: 36, op:
> "query", ns: "local.oplog.$main", query: { query: {}, orderby: { $natural:
> -1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId:
> "0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 0,
> lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } }
> 2016-01-14T19:01:11.429+0000 [initandlisten] waiting for connections on
> port 27019
> 2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted from
> 10.2.0.3:51189 #1 (1 connection now open)
> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from
> 10.2.0.3:51190 #2 (2 connections now open)
> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from
> 10.2.0.3:51191 #3 (3 connections now open)
> 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation detected,
> adding sharding hook to enable versioning and authentication to remote
> servers
> 2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0
> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
> 2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted from
> 10.2.0.3:51193 #4 (4 connections now open)
> 2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted from
> 10.2.0.3:51194 #5 (5 connections now open)
> 2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189 (4
> connections now open)
> 2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59 virt:385
> 2016-01-14T19:02:11.285+0000 [clientcursormon]  mapped (incl journal
> view):192
> 2016-01-14T19:02:11.285+0000 [clientcursormon]  connections:4
> 2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72 virt:385
> 2016-01-14T19:03:11.294+0000 [clientcursormon]  mapped (incl journal
> view):192
> 2016-01-14T19:03:11.294+0000 [clientcursormon]  connections:4
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> Killing docker task
>
> Most disturbing in all of this is that while I can stop the deployments in
> Marathon (which properly end the "docker stop" commands visible in ps
> output), I can not bounce docker, not by Upstart, nor via kill command).
> Ultimately, I have to reboot the VM.
>
> FWIW, the 3 mongod containers (apparently stuck in their Killing docker
> task/shutting down loop) are running at 100%CPU as evinced by both "docker
> stats" and "top".
>
> I would truly be grateful for some guidance on this - even a mere
> work-around would be appreciated.
>
> Thank you.
>
> -Paul
>

Mime
View raw message