mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anand Mazumdar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.
Date Mon, 09 May 2016 19:31:12 GMT

     [ https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Anand Mazumdar updated MESOS-5349:
----------------------------------
    Attachment: log_agent_before_zk_disconnect.gz

> A large number of tasks stuck in Staging state.
> -----------------------------------------------
>
>                 Key: MESOS-5349
>                 URL: https://issues.apache.org/jira/browse/MESOS-5349
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.29.0
>            Reporter: Anand Mazumdar
>              Labels: mesosphere
>         Attachments: agent-state.log, log_agent_after_zk_disconnect.gz, log_agent_before_zk_disconnect.gz,
master-state.log, mesos-master.WARNING, mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend. A large
number of tasks from the example {{long running framework}} were stuck in staging. The executor
was duly sending status updates for all the tasks and the slave successfully received the
status update as seen from the logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status updates
starting with the 4xxx/6xxx range while the present tasks were launched in the 8xxx range.
(task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering with the
master upon loosing its ZK session. 
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 slave.cpp:1522]
Got assigned task 8142 for framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 slave.cpp:1641]
Launching task 8142 for framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 slave.cpp:1880]
Queuing task '8142' for executor 'default' of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
(via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 disk.cpp:169]
Updating the disk resources for container f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101;
mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 mem.cpp:353]
Updated 'memory.soft_limit_in_bytes' to 33MB for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 cpushare.cpp:389]
Updated 'cpu.shares' to 103 (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 cpushare.cpp:411]
Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10100us (cpus 0.101) for container
f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 slave.cpp:2032]
Sending queued task '8142' to executor 'default' of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
(via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 http.cpp:188]
HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 slave.cpp:3221]
Handling status update TASK_RUNNING (UUID: 85323c4f-e523-495e-9b49-39b0a7792303) for task
8142 of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 http.cpp:188]
HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 slave.cpp:3221]
Handling status update TASK_FINISHED (UUID: 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task
8142 of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 disk.cpp:169]
Updating the disk resources for container f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1;
mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 cpushare.cpp:389]
Updated 'cpu.shares' to 102 (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 mem.cpp:353]
Updated 'memory.soft_limit_in_bytes' to 32MB for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 cpushare.cpp:411]
Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container
f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> {code}
> May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 state.cpp:606]
Failed to find status updates file '/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
> {code}
> Things that need to be investigated:
> - Why couldn't the agent get around to handling the status updates from the executor
i.e. even checkpointing them?
> - What made the agent get _so_ backlogged on the status updates i.e. why it kept resending
the old status updates for the 4xxxx/6xxxx tasks without getting around to the newer tasks.
> PFA the agent/master logs. This is running against Mesos HEAD -> 557cab591f35a6c3d2248d7af7f06cdf99726e92



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message