aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zameer Manji (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1789) namespaces/pid isolator causes lost process
Date Thu, 06 Oct 2016 23:36:20 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553561#comment-15553561
] 

Zameer Manji commented on AURORA-1789:
--------------------------------------

Here is my gut feeling on the first cut of these logs and what I see in the code. I'm focusing
on {{apache/thermos/core/process.py}}.

The coordinator appears to be forked and running. Once it is forked, it is supposed to execute
the target process, and then write a checkpoint indicating the process is RUNNING and with
the pid. That never happens here.

Another thing going on here is that this task is running using the unified containerizer and
has a task filesystem. To me there appears to be a bug inside the {{execute()}} method, perhaps
in trying to execute the target process via the mesos-containerizer.



> namespaces/pid isolator causes lost process
> -------------------------------------------
>
>                 Key: AURORA-1789
>                 URL: https://issues.apache.org/jira/browse/AURORA-1789
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>    Affects Versions: 0.16.0
>            Reporter: Justin Pinkul
>            Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker image the
Thermos executor is unable to launch processes. The executor tries to fork the process then
is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=205,
process=u'BigBrother start', start_time=None, coordinator_pid=1144, pid=None, return_code=None,
state=1, stop_time=None, fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=208,
process=u'BigBrother start', start_time=None, coordinator_pid=1157, pid=None, return_code=None,
state=1, stop_time=None, fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=211,
process=u'BigBrother start', start_time=None, coordinator_pid=1170, pid=None, return_code=None,
state=1, stop_time=None, fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=214,
process=u'BigBrother start', start_time=None, coordinator_pid=1183, pid=None, return_code=None,
state=1, stop_time=None, fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=217,
process=u'BigBrother start', start_time=None, coordinator_pid=1196, pid=None, return_code=None,
state=1, stop_time=None, fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=220,
process=u'BigBrother start', start_time=None, coordinator_pid=1209, pid=None, return_code=None,
state=1, stop_time=None, fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message