mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-5380) Killing a queued task can cause the corresponding command executor to never terminate.
Date Wed, 25 May 2016 06:30:12 GMT

    [ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299539#comment-15299539
] 

Vinod Kone commented on MESOS-5380:
-----------------------------------

Backported the fix to 0.28.x.

> Killing a queued task can cause the corresponding command executor to never terminate.
> --------------------------------------------------------------------------------------
>
>                 Key: MESOS-5380
>                 URL: https://issues.apache.org/jira/browse/MESOS-5380
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.28.0, 0.28.1
>            Reporter: Jie Yu
>            Assignee: Vinod Kone
>            Priority: Blocker
>              Labels: mesosphere
>             Fix For: 0.29.0, 0.28.2
>
>
> We observed this in our testing environment. Sequence of events:
> 1) A command task is queued since the executor has not registered yet.
> 2) The framework issues a killTask.
> 3) Since executor is in REGISTERING state, agent calls `statusUpdate(TASK_KILLED, UPID())`
> 4) `statusUpdate` now will call `containerizer->status()` before calling `executor->terminateTask(status.task_id(),
status);` which will remove the queued task. (Introduced in this patch: https://reviews.apache.org/r/43258).
> 5) Since the above is async, it's possible that the task is still in queued task when
we trying to see if we need to kill unregistered executor in `killTask`:
> {code}
>       // TODO(jieyu): Here, we kill the executor if it no longer has
>       // any task to run and has not yet registered. This is a
>       // workaround for those single task executors that do not have a
>       // proper self terminating logic when they haven't received the
>       // task within a timeout.
>       if (executor->queuedTasks.empty()) {
>         CHECK(executor->launchedTasks.empty())
>             << " Unregistered executor '" << executor->id
>             << "' has launched tasks";
>         LOG(WARNING) << "Killing the unregistered executor " << *executor
>                      << " because it has no tasks";
>         executor->state = Executor::TERMINATING;
>         containerizer->destroy(executor->containerId);
>       }    
> {code}
> 6) Consequently, the executor will never be terminated by Mesos.
> Attaching the relevant agent log:
> {noformat}
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.640527
 1342 slave.cpp:1361] Got assigned task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for
framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.641034
 1342 slave.cpp:1480] Launching task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework
a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.641440
 1342 paths.cpp:528] Trying to chown '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a'
to user 'root'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.644664
 1342 slave.cpp:5389] Launching executor mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of
framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 with resources cpus(*):0.1; mem(*):32
in work directory '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.645195
 1342 slave.cpp:1698] Queuing task 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' for executor
'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.645491
 1338 containerizer.cpp:671] Starting container '24762d43-2134-475e-b724-caa72110497a' for
executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.647897
 1345 cpushare.cpp:389] Updated 'cpu.shares' to 1126 (cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.648619
 1345 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 110ms
(cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.650180
 1341 mem.cpp:602] Started listening for OOM events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.650718
 1341 mem.cpp:722] Started listening on low memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.651147
 1341 mem.cpp:722] Started listening on medium memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.651599
 1341 mem.cpp:722] Started listening on critical memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.652015
 1341 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 160MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.652719
 1341 mem.cpp:388] Updated 'memory.limit_in_bytes' to 160MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.508930
 1342 slave.cpp:1891] Asked to kill task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of
framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.509063
 1342 slave.cpp:3048] Handling status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba)
for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
from @0.0.0.0:0
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.509702
 1340 disk.cpp:169] Updating the disk resources for container 24762d43-2134-475e-b724-caa72110497a
to cpus(*):0.1; mem(*):32
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.510298
 1343 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.510349
 1341 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.511102
 1343 mem.cpp:388] Updated 'memory.limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.511495
 1341 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms
(cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.511715
 1341 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba)
for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.512032
 1341 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_KILLED (UUID:
f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6
of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.513849
 1343 slave.cpp:3446] Forwarding the update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba)
for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
to master@10.0.5.79:5050
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.528929
 1344 status_update_manager.cpp:392] Received status update acknowledgement (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba)
for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.529002
 1344 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_KILLED (UUID:
f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6
of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.199105
 1345 isolator.cpp:469] Mounting docker volume mount point '//var/lib/rexray/volumes/jdef-test-125/data'
to '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/data'
for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.207062
 1338 containerizer.cpp:1184] Checkpointing executor's forked pid 5810 to '/var/lib/mesos/slave/meta/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/pids/forked.pid'
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.832330
 1338 slave.cpp:2689] Got registration for executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6'
of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 from executor(1)@10.0.2.74:46154
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.833149
 1345 disk.cpp:169] Updating the disk resources for container 24762d43-2134-475e-b724-caa72110497a
to cpus(*):0.1; mem(*):32
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.833804
 1342 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.833871
 1340 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.835160
 1340 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms
(cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5804 'mesos-logrotate-logger
--help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout
--logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5809 'mesos-logrotate-logger
--help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr
--logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5804 'mesos-logrotate-logger
--help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout
--logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5809 'mesos-logrotate-logger
--help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr
--logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.374567
30993 slave.cpp:5498] Recovering executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6'
of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.420411
30990 status_update_manager.cpp:208] Recovering executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6'
of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.513164
30994 containerizer.cpp:467] Recovering container '24762d43-2134-475e-b724-caa72110497a' for
executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.533478
30988 mem.cpp:602] Started listening for OOM events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.534553
30988 mem.cpp:722] Started listening on low memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.535269
30988 mem.cpp:722] Started listening on medium memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.536198
30988 mem.cpp:722] Started listening on critical memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.579385
30988 docker.cpp:859] Skipping recovery of executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6'
of framework 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000' because it was not launched from
docker containerizer
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.587158
30989 slave.cpp:4527] Sending reconnect request to executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6'
of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 at executor(1)@10.0.2.74:46154
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.588287
30990 slave.cpp:2838] Re-registering executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6'
of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.589736
30988 disk.cpp:169] Updating the disk resources for container 24762d43-2134-475e-b724-caa72110497a
to cpus(*):0.1; mem(*):32
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.590117
30990 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.591284
30990 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms
(cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.595403
30992 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.596102
30992 mem.cpp:388] Updated 'memory.limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message