aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "alexius ludeman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AURORA-1362) thermos_executor stop responding to commands
Date Fri, 19 Jun 2015 17:05:00 GMT

     [ https://issues.apache.org/jira/browse/AURORA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

alexius ludeman updated AURORA-1362:
------------------------------------
    Description: 
if https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/sandbox.py
raises any exceptions then thermos_executor continues to run but no longer responds to any
commands.  It's orphaned and continues to consume resources until manually killed.

Based on conversation with Maxim on #aurora, the correct action is likely to catch https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/aurora_executor.py#L122
and exit appropriately.

To reproduce attempt to launch as a non-exist user on the slave, or causing a chmod/chown
failure which will raise CreationError.  Once this occurs one will see that aurora UI never
passes state STARTING.  When transient_task_state_timeout is reached then the task state moves
to LOST.  thermos_executor will be still running on the slave and mesos considers the task
still active and state is STARTING.  Unfortunately GC will be unable to clean it up as it
does not know about it.  At this point there is nothing to recover this orphaned thermos_executor
short of killing it by hand.

Sorry the line numbers will not match due to local changes, but the stacktrace should be accurate.
https://gist.github.com/lexinator/ca95b249c7cb25575395


  was:
if https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/sandbox.py
raises any exceptions then thermos_executor continues to run but no longer responds to any
commands.

Based on conversation with Maxim on #aurora, the correct action is likely to catch https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/aurora_executor.py#L122
and exit appropriately.

To reproduce attempt to launch as a non-exist user on the slave, or causing a chmod/chown
failure which will raise CreationError.  Once this occurs one will see that aurora UI never
passes state STARTING.  When transient_task_state_timeout is reached then the task state moves
to LOST.  thermos_executor will be still running on the slave and mesos considers the task
still active and state is STARTING.  Unfortunately GC will be unable to clean it up as it
does not know about it.  At this point there is nothing to recover this orphaned thermos_executor
short of killing it by hand.

Sorry the line numbers will not match due to local changes, but the stacktrace should be accurate.
https://gist.github.com/lexinator/ca95b249c7cb25575395



> thermos_executor stop responding to commands
> --------------------------------------------
>
>                 Key: AURORA-1362
>                 URL: https://issues.apache.org/jira/browse/AURORA-1362
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>    Affects Versions: 0.7.0
>            Reporter: alexius ludeman
>
> if https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/sandbox.py
raises any exceptions then thermos_executor continues to run but no longer responds to any
commands.  It's orphaned and continues to consume resources until manually killed.
> Based on conversation with Maxim on #aurora, the correct action is likely to catch https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/aurora_executor.py#L122
and exit appropriately.
> To reproduce attempt to launch as a non-exist user on the slave, or causing a chmod/chown
failure which will raise CreationError.  Once this occurs one will see that aurora UI never
passes state STARTING.  When transient_task_state_timeout is reached then the task state moves
to LOST.  thermos_executor will be still running on the slave and mesos considers the task
still active and state is STARTING.  Unfortunately GC will be unable to clean it up as it
does not know about it.  At this point there is nothing to recover this orphaned thermos_executor
short of killing it by hand.
> Sorry the line numbers will not match due to local changes, but the stacktrace should
be accurate.
> https://gist.github.com/lexinator/ca95b249c7cb25575395



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message