mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Bell <arach...@gmail.com>
Subject Re: Help needed (alas, urgently)
Date Fri, 15 Jan 2016 14:19:30 GMT
In chasing down this problem, I stumbled upon something of moment: the
problem does NOT seem to happen with kernel 3.13.

Some weeks back, in the hope of getting past another problem wherein the
root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
LTS). The kernel upgrade was done as shown here (there's some extra stuff
to get rid of Ubuntu desktop and liberate some disk space):

  apt-get update
  apt-get -y remove unbuntu-desktop
  apt-get -y purge lightdm
  rm -Rf /var/lib/lightdm-data
  apt-get -y remove --purge libreoffice-core
  apt-get -y remove --purge libreoffice-common

  echo "  Installing new kernel"

  apt-get -y install linux-generic-lts-vivid
  apt-get -y autoremove linux-image-3.13.0-32-generic
  apt-get -y autoremove linux-image-3.13.0-71-generic
  update-grub
  reboot

After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.

Under this kernel I can now reliably reproduce the failure to stop a
MongoDB container. Specifically, any & all attempts to kill the container,
e.g.,via

Marathon HTTP Delete (which leads to docker-mesos-executor presenting
"docker stop" command)
Getting inside the running container shell and issuing "kill" or
db.shutDown()

causes the mongod container

   - to show in its log that it's shutting down normally
   - to enter a 100% CPU loop
   - to become unkillable (only reboot "fixes" things)

Note finally that my conclusion about kernel 3.13 "working" is at present a
weak induction. But I do know that when I reverted to that kernel I could,
at least once, stop the containers w/o any problems; whereas at 3.19 I can
reliably reproduce the problem. I will try to make this induction stronger
as the day wears on.

Did I do something "wrong" in my kernel upgrade steps?

Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
area of task termination & signal handling?

Thanks for your help.

-Paul


On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell <arachweb@gmail.com> wrote:

> I spoke to soon, I'm afraid.
>
> Next time I did the stop (with zero timeout), I see the same phenomenon: a
> mongo container showing repeated:
>
> killing docker task
> shutting down
>
>
> What else can I try?
>
> Thank you.
>
> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arachweb@gmail.com> wrote:
>
>> Hi Tim,
>>
>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>> (though a bit fearful about being pleased) that this change seems to have
>> shut everyone down pretty much instantly.
>>
>> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
>> the immediate use of "kill -9" as opposed to "kill -2"?
>>
>> I will keep testing the behavior.
>>
>> Thank you.
>>
>> -Paul
>>
>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arachweb@gmail.com> wrote:
>>
>>> Hi Tim,
>>>
>>> Things have gotten slightly odder (if that's possible). When I now start
>>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>>> and even he took a few tries. That is, I see him failing, moving to
>>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>>> docker ctr logs) that show why.
>>>
>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>> application I am seeing the phenomena I posted before: killing docker task,
>>> shutting down repeated many times. The UN-stopped container is now running
>>> at 100% CPU.
>>>
>>> I will try modifying docker_stop_timeout. Back shortly
>>>
>>> Thanks again.
>>>
>>> -Paul
>>>
>>> PS: what do you make of the "broken pipe" error in the docker.log?
>>>
>>> *from /var/log/upstart/docker.log*
>>>
>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>>> [34mINFO [0m[3054] GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> [31mERRO [0m[3054] Handler for GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> returned error: No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mERRO [0m[3054] HTTP Error
>>> [31merr [0m=No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mstatusCode [0m=404
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1&stdout=1&stream=1
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET
>>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> [34mINFO [0m[3111] GET /v1.21/containers/json
>>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
>>> [34mINFO [0m[3120] GET
>>> /v1.21/containers/cf7/logs?stderr=1&stdout=1&tail=all
>>> [34mINFO [0m[3153] GET /containers/json
>>> [34mINFO [0m[3153] GET
>>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3153] GET
>>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>>> [34mINFO [0m[3153] GET
>>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>>> [34mINFO [0m[3175] GET /containers/json
>>> [34mINFO [0m[3175] GET
>>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3175] GET
>>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>>> [34mINFO [0m[3175] GET
>>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>>> * [34mINFO [0m[3175] POST
>>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
>>> ?t=15
>>> * [31mERRO [0m[3175] attach: stdout: write unix @: broken pipe    *
>>> * [34mINFO [0m[3190] Container
>>> cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
>>> exit within 15 seconds of SIGTERM - using the force *
>>> * [34mINFO [0m[3200] Container cf7fc7c48324 failed to exit within 10
>>> seconds of kill - trying direct SIGKILL *
>>>
>>> *STDOUT from Mesos:*
>>>
>>> *--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
>>> *--docker="/usr/local/ecxmcc/weaveShim" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
>>> --stop_timeout="15secs"
>>> --container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
>>> --docker="/usr/local/ecxmcc/weaveShim" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
>>> --stop_timeout="15secs"
>>> Registered docker executor on 71.100.202.99
>>> Starting task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285
>>> 2016-01-14T20:45:38.613+0000 [initandlisten] MongoDB starting : pid=1
>>> port=27017 dbpath=/data/db 64-bit host=ecxconfigdb
>>> 2016-01-14T20:45:38.614+0000 [initandlisten] db version v2.6.8
>>> 2016-01-14T20:45:38.614+0000 [initandlisten] git version:
>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700
>>> 2016-01-14T20:45:38.614+0000 [initandlisten] build info: Linux
>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27
>>> UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
>>> 2016-01-14T20:45:38.614+0000 [initandlisten] allocator: tcmalloc
>>> 2016-01-14T20:45:38.614+0000 [initandlisten] options: { storage: {
>>> journal: { enabled: true } } }
>>> 2016-01-14T20:45:38.616+0000 [initandlisten] journal dir=/data/db/journal
>>> 2016-01-14T20:45:38.616+0000 [initandlisten] recover : no journal files
>>> present, no recovery needed
>>> 2016-01-14T20:45:39.006+0000 [initandlisten] waiting for connections on
>>> port 27017
>>> 2016-01-14T20:46:38.975+0000 [clientcursormon] mem (MB) res:77 virt:12942
>>> 2016-01-14T20:46:38.975+0000 [clientcursormon]  mapped (incl journal
>>> view):12762
>>> 2016-01-14T20:46:38.975+0000 [clientcursormon]  connections:0
>>> Killing docker task
>>> Shutting down
>>> Killing docker task
>>> Shutting down
>>> Killing docker task
>>> Shutting down
>>>
>>> On Thu, Jan 14, 2016 at 3:38 PM, Paul Bell <arachweb@gmail.com> wrote:
>>>
>>>> Hey Tim,
>>>>
>>>> Thank you very much for your reply.
>>>>
>>>> Yes, I am in the midst of trying to reproduce the problem. If
>>>> successful (so to speak), I will do as you ask.
>>>>
>>>> Cordially,
>>>>
>>>> Paul
>>>>
>>>> On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <tim@mesosphere.io> wrote:
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Looks like we've already issued the docker stop as you seen in the ps
>>>>> output, but the containers are still running. Can you look at the Docker
>>>>> daemon logs and see what's going on there?
>>>>>
>>>>> And also can you also try to modify docker_stop_timeout to 0 so that
>>>>> we SIGKILL the containers right away, and see if this still happens?
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arachweb@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> It's been quite some time since I've posted here and that's chiefly
>>>>>> because up until a day or two ago, things were working really well.
>>>>>>
>>>>>> I actually may have posted about this some time back. But then the
>>>>>> problem seemed more intermittent.
>>>>>>
>>>>>> In summa, several "docker stops" don't work, i.e., the containers
are
>>>>>> not stopped.
>>>>>>
>>>>>> Deployment:
>>>>>>
>>>>>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>>>>>> Zookeeper
>>>>>> Mesos-master (0.23.0)
>>>>>> Mesos-slave (0.23.0)
>>>>>> Marathon (0.10.0)
>>>>>> Docker 1.9.1
>>>>>> Weave 1.1.0
>>>>>> Our application contains which include
>>>>>>     MongoDB (4)
>>>>>>     PostGres
>>>>>>     ECX (our product)
>>>>>>
>>>>>> The only thing that's changed at all in the config above is the
>>>>>> version of Docker. Used to be 1.6.2 but I today upgraded it hoping
to solve
>>>>>> the problem.
>>>>>>
>>>>>>
>>>>>> My automater program stops the application by sending Marathon an
>>>>>> "http delete" for each running up. Every now & then (reliably
reproducible
>>>>>> today) not all containers get stopped. Most recently, 3 containers
failed
>>>>>> to stop.
>>>>>>
>>>>>> Here are the attendant phenomena:
>>>>>>
>>>>>> Marathon shows the 3 applications in deployment mode (presumably
>>>>>> "deployment" in the sense of "stopping")
>>>>>>
>>>>>> *ps output:*
>>>>>>
>>>>>> root@71:~# ps -ef | grep docker
>>>>>> root      3823     1  0 13:55 ?        00:00:02 /usr/bin/docker
>>>>>> daemon -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>>>>>> root      4967     1  0 13:57 ?        00:00:01 /usr/sbin/mesos-slave
>>>>>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>>>>>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>>>>>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>>>>>> --hostname=71.100.202.99 --ip=71.100.202.99
>>>>>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>>>>>> root      5263  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>> tcp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2
>>>>>> -container-port 6783
>>>>>> root      5271  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>> udp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2
>>>>>> -container-port 6783
>>>>>> root      5279  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>> tcp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2
>>>>>> -container-port 53
>>>>>> root      5287  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>> udp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2
>>>>>> -container-port 53
>>>>>> root      7119  4967  0 14:00 ?        00:00:01 mesos-docker-executor
>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>>>>>> --mapped_directory=/mnt/mesos/sandbox
>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>>>>>> --stop_timeout=15secs
>>>>>> root      7378  4967  0 14:00 ?        00:00:01 mesos-docker-executor
>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>>>>>> --mapped_directory=/mnt/mesos/sandbox
>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>>>>>> --stop_timeout=15secs
>>>>>> root      7640  4967  0 14:01 ?        00:00:01 mesos-docker-executor
>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>>>>>> --mapped_directory=/mnt/mesos/sandbox
>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
>>>>>> --stop_timeout=15secs
>>>>>> *root      9696  9695  0 14:06 ?        00:00:00 /usr/bin/docker
stop
>>>>>> -t 15
>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
>>>>>> *root      9709  9708  0 14:06 ?        00:00:00 /usr/bin/docker
stop
>>>>>> -t 15
>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
>>>>>> *root      9720  9719  0 14:06 ?        00:00:00 /usr/bin/docker
stop
>>>>>> -t 15
>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*
>>>>>>
>>>>>> *docker ps output:*
>>>>>>
>>>>>> root@71:~# docker ps
>>>>>> CONTAINER ID        IMAGE                         COMMAND
>>>>>>      CREATED             STATUS              PORTS
>>>>>>                                                                 
NAMES
>>>>>> 5abafbfe7de2        mongo:2.6.8                   "/w/w
>>>>>> /entrypoint.sh "   11 minutes ago      Up 11 minutes       27017/tcp
>>>>>>
>>>>>>
>>>>>>  mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>>>>>> a8449682ca2e        mongo:2.6.8                   "/w/w
>>>>>> /entrypoint.sh "   11 minutes ago      Up 11 minutes       27017/tcp
>>>>>>
>>>>>>
>>>>>>  mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>>>>>> 3b956457374b        mongo:2.6.8                   "/w/w
>>>>>> /entrypoint.sh "   11 minutes ago      Up 11 minutes       27017/tcp
>>>>>>
>>>>>>
>>>>>>  mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>>>>>> 4c1588bb3d4b        weaveworks/weaveexec:v1.1.0
>>>>>> "/home/weave/weavepro"   15 minutes ago      Up 15 minutes
>>>>>>
>>>>>>          weaveproxy
>>>>>> a26a0363584b        weaveworks/weave:v1.1.0       "/home/weave/weaver
>>>>>> -"   15 minutes ago      Up 15 minutes       172.17.0.1:53->53/tcp,
>>>>>> 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp, 0.0.0.0:6783->6783/udp
>>>>>>   weave
>>>>>>
>>>>>> *from /var/log/syslog:*
>>>>>>
>>>>>>
>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405  5002
>>>>>> master.cpp:2944] Asked to kill task
>>>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>> *Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459  5002
>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0
at
>>>>>> slave(1)@71.100.202.99:5051 <http://71.100.202.99:5051> (71.100.202.99)
to
>>>>>> kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>>>>>> <http://scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167>*
>>>>>> *Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729  5042
>>>>>> slave.cpp:1755] Asked to kill task
>>>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000*
>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295  5004
>>>>>> http.cpp:283] HTTP GET for /master/state.json from 172.19.15.61:65038
>>>>>> with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
>>>>>> (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904  5001
>>>>>> master.cpp:2944] Asked to kill task
>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935  5001
>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0
at
>>>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>>>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136  5041
>>>>>> slave.cpp:1755] Asked to kill task
>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932  4998
>>>>>> master.cpp:2944] Asked to kill task
>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958  4998
>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0
at
>>>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>>>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151  5038
>>>>>> slave.cpp:1755] Asked to kill task
>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>> Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009  5001
>>>>>> master.cpp:4290] Sending 1 offers to framework
>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>>>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO
>>>>>> started processing 1 offers, launching at most 1 tasks per offer
and 1000
>>>>>> tasks in total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132)
>>>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO
>>>>>> Offer [20160114-135722-1674208327-5050-4917-O128]. Decline with default
>>>>>> filter refuseSeconds (use --decline_offer_duration to configure)
>>>>>> (mesosphere.marathon.tasks.IterativeOfferMatcher$:231)
>>>>>>
>>>>>>
>>>>>> *from Mesos STDOUT of unstopped container:*
>>>>>>
>>>>>> Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9
>>>>>> 2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting : pid=1
>>>>>> port=27019 dbpath=/data/db/config master=1 64-bit host=mongoconfig
>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8
>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] git version:
>>>>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700
>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux
>>>>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
>>>>>> 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc
>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding:
{
>>>>>> clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config",
journal:
>>>>>> { enabled: true } } }
>>>>>> 2016-01-14T19:01:10.999+0000 [initandlisten] journal
>>>>>> dir=/data/db/config/journal
>>>>>> 2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal
>>>>>> files present, no recovery needed
>>>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] warning:
>>>>>> ClientCursor::staticYield can't unlock b/c of recursive lock ns:
 top: {
>>>>>> opid: 11, active: true, secs_running: 0, microsecs_running: 36, op:
>>>>>> "query", ns: "local.oplog.$main", query: { query: {}, orderby: {
$natural:
>>>>>> -1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId:
>>>>>> "0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields:
0,
>>>>>> lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } }
>>>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] waiting for connections
>>>>>> on port 27019
>>>>>> 2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted
from
>>>>>> 10.2.0.3:51189 #1 (1 connection now open)
>>>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted
from
>>>>>> 10.2.0.3:51190 #2 (2 connections now open)
>>>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted
from
>>>>>> 10.2.0.3:51191 #3 (3 connections now open)
>>>>>> 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation
>>>>>> detected, adding sharding hook to enable versioning and authentication
to
>>>>>> remote servers
>>>>>> 2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>> 2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted
from
>>>>>> 10.2.0.3:51193 #4 (4 connections now open)
>>>>>> 2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted
from
>>>>>> 10.2.0.3:51194 #5 (5 connections now open)
>>>>>> 2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189
>>>>>> (4 connections now open)
>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59
>>>>>> virt:385
>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon]  mapped (incl journal
>>>>>> view):192
>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon]  connections:4
>>>>>> 2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72
>>>>>> virt:385
>>>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon]  mapped (incl journal
>>>>>> view):192
>>>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon]  connections:4
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>> Shutting down
>>>>>> Killing docker task
>>>>>>
>>>>>> Most disturbing in all of this is that while I can stop the
>>>>>> deployments in Marathon (which properly end the "docker stop" commands
>>>>>> visible in ps output), I can not bounce docker, not by Upstart, nor
via
>>>>>> kill command).  Ultimately, I have to reboot the VM.
>>>>>>
>>>>>> FWIW, the 3 mongod containers (apparently stuck in their Killing
>>>>>> docker task/shutting down loop) are running at 100%CPU as evinced
by both
>>>>>> "docker stats" and "top".
>>>>>>
>>>>>> I would truly be grateful for some guidance on this - even a mere
>>>>>> work-around would be appreciated.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> -Paul
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message