aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Sweeney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-788) gc executor updates ignored by slave on loaded machines
Date Fri, 03 Oct 2014 00:03:33 GMT

    [ https://issues.apache.org/jira/browse/AURORA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157479#comment-14157479
] 

Kevin Sweeney commented on AURORA-788:
--------------------------------------

https://reviews.apache.org/r/26300/

> gc executor updates ignored by slave on loaded machines
> -------------------------------------------------------
>
>                 Key: AURORA-788
>                 URL: https://issues.apache.org/jira/browse/AURORA-788
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Kevin Sweeney
>            Assignee: Kevin Sweeney
>             Fix For: 0.6.0
>
>
> It looks like the root cause is that the GC executor doesn't sleep long enough for the
slave to reliably receive its final status update
> From production logs at Twitter:
> {noformat}
> mesos-slave.log:
> I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
for framework 201205082337-0000000003-0000
> I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
for framework 201205082337-0000000003-0000
> I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to '/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
> I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task 'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d'
to executor 'aurora.gc' of framework 201205082337-0000000003-0000
> I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204 (cpus 0.2)
for container f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB
for container f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 100ms and
'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container 'f97928c4-4214-439c-8d9f-c8493070b1db'
has exited
> I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container 'f97928c4-4214-439c-8d9f-c8493070b1db'
> I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
after 1.140736ms
> I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
after 1.0752ms
> I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters with ports
[33792,34815] for container with pid 19217
> I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports [33792,34816)
for container with pid 19217
> I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed cleanup for
pid 19217
> I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 201205082337-0000000003-0000
exited with status 0
> I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f)
for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 201205082337-0000000003-0000
from @0.0.0.0:0
> W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown container:
f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
> I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing UPDATE for status
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
> I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000 to master@10.44.8.124:5050
> I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status update acknowledgement
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
> I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK for status
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d
of framework 201205082337-0000000003-0000
> {noformat}
> Note the race - the slave notices the dead GC executor process 300┬Ás before it sees
that status update from the executor.
> {noformat}
> I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 201205082337-0000000003-0000
exited with status 0
> I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f)
for task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 201205082337-0000000003-0000
from @0.0.0.0:0
> W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown container:
f97928c4-4214-439c-8d9f-c8493070b1db
> {noformat}
> {noformat}
> gc_executor_main.log:
> I1002 23:35:11.663753 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
  - Runner is dead but under LOST threshold.
> I1002 23:36:18.483422 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
launchTask() got task_id: system-gc-04810a63-e8b1-42a4-9b08-03b099517856
> I1002 23:36:18.483834 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
=> Adding system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
> I1002 23:40:11.664382 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
Updating system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
> I1002 23:40:11.664607 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
   Reason: Garbage collection skipped - GC executor shutting down
> I1002 23:40:11.664880 19217 executor_base.py:46] Executor [20140916-223909-2215521290-5050-52027-3819]:
Sleeping briefly to mitigate https://issues.apache.org/jira/browse/MESOS-243
> I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run() has finished.
> {noformat}
> Given that the GC executor is already in maintenance mode and we already have a sleep
in place to mitigate https://issues.apache.org/jira/browse/MESOS-243 I think we can just raise
the sleep timeout here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message