mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Rukletsov (JIRA)" <>
Subject [jira] [Updated] (MESOS-2334) Tasks get stuck in TASK_STAGING after a network decode error
Date Wed, 11 Feb 2015 13:26:11 GMT


Alexander Rukletsov updated MESOS-2334:
    Affects Version/s: 0.21.0

> Tasks get stuck in TASK_STAGING after a network decode error
> ------------------------------------------------------------
>                 Key: MESOS-2334
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>            Reporter: Andreas Raster
> We observed that with a test case that schedules a large amount of small CommandInfo
tasks (shell commands that look like this: "sleep `shuf -i 2-3 -n 1`; echo foo >> /share/bar")
on a cluster with launchTasks, that sometimes we would get an issue where a single task that
has been launched and was set to TASK_STAGING would never receive a TASK_RUNNING message (or
any other message at all). So it would then just stay in TASK_STAGING infinitely until we
would kill the framework.
> We asked in #mesos on freenode about this and got an answer from alexr_:
> [15:56:55] <alexr_> henno: thanks for the slave logs
> [15:57:09] rakete [] has left #mesos
> [15:58:47] <alexr_> henno: it looks from the logs, that the slave successfully
registers the executor and sends the task
> [15:59:07] tillt_ [~Till@] has joined #mesos
> [15:59:30] <alexr_> the executor, for some reason, refuses to start the task, most
probably because of the message decoding error
> telling us that he suspects the reason is a network decoding error. I am currently not
100% sure what he means by that and I wasn't the guy talking to alexr_ on irc so I cannot
post the exact log section that indicates that decoding error. But I'll attach the logs that
we supplied to alexr_, so those should contain the relevant information.
> The tasks name in question was: 727527fc-a3f3-418d-a44e-ec3bbdd26315
> cat /var/log/mesos/mesos-slave.INFO | grep 727527fc-a3f3-418d-a44e-ec3bbdd26315
> >>
> cat /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stderr

> >>
> cat /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stdout
> >>
> Now, if some relevant information is still missing, don't hesitate to ask me.

This message was sent by Atlassian JIRA

View raw message