aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Sweeney (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AURORA-788) GC executor doesn't need to exit after period of inactivity
Date Wed, 08 Oct 2014 18:54:33 GMT

     [ https://issues.apache.org/jira/browse/AURORA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kevin Sweeney updated AURORA-788:
---------------------------------
    Description: The GC Executor is currently configured to exit after a hardcoded limit of
activity since it processed a launchTask request. This causes needless task churn in the cluster,
especially given that the scheduler's default launch interval will cause the executor to need
to be relaunched on average every 15 minutes.  (was: It looks like the root cause is that
the GC executor doesn't sleep long enough for the slave to reliably receive its final status
update

>From production logs at Twitter:
{noformat}
mesos-slave.log:
I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
for framework 201205082337-0000000003-0000
I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
for framework 201205082337-0000000003-0000
I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to '/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task 'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d'
to executor 'aurora.gc' of framework 201205082337-0000000003-0000
I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204 (cpus 0.2) for container
f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB for
container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
to 20ms (cpus 0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container 'f97928c4-4214-439c-8d9f-c8493070b1db'
has exited
I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container 'f97928c4-4214-439c-8d9f-c8493070b1db'
I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
after 1.140736ms
I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
after 1.0752ms
I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters with ports [33792,34815]
for container with pid 19217
I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports [33792,34816) for
container with pid 19217
I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed cleanup for pid
19217
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 201205082337-0000000003-0000
exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f)
for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 201205082337-0000000003-0000
from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown container:
f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing UPDATE for status
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000 to master@10.44.8.124:5050
I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status update acknowledgement
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK for status update
TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
{noformat}

Note the race - the slave notices the dead GC executor process 300┬Ás before it sees that
status update from the executor.
{noformat}
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 201205082337-0000000003-0000
exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f)
for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 201205082337-0000000003-0000
from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown container:
f97928c4-4214-439c-8d9f-c8493070b1db
{noformat}

{noformat}
gc_executor_main.log:
I1002 23:35:11.663753 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
  - Runner is dead but under LOST threshold.
I1002 23:36:18.483422 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
launchTask() got task_id: system-gc-04810a63-e8b1-42a4-9b08-03b099517856
I1002 23:36:18.483834 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
=> Adding system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
I1002 23:40:11.664382 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
Updating system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
I1002 23:40:11.664607 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
   Reason: Garbage collection skipped - GC executor shutting down
I1002 23:40:11.664880 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
Sleeping briefly to mitigate https://issues.apache.org/jira/browse/MESOS-243
I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run() has finished.
{noformat}

Given that the GC executor is already in maintenance mode and we already have a sleep in place
to mitigate https://issues.apache.org/jira/browse/MESOS-243 I think we can just raise the
sleep timeout here.)

> GC executor doesn't need to exit after period of inactivity
> -----------------------------------------------------------
>
>                 Key: AURORA-788
>                 URL: https://issues.apache.org/jira/browse/AURORA-788
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Kevin Sweeney
>            Assignee: Kevin Sweeney
>             Fix For: 0.6.0
>
>
> The GC Executor is currently configured to exit after a hardcoded limit of activity since
it processed a launchTask request. This causes needless task churn in the cluster, especially
given that the scheduler's default launch interval will cause the executor to need to be relaunched
on average every 15 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message