aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Erb (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1335) Thermos should not immediately resort to killing processes
Date Fri, 03 Jul 2015 08:31:04 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612999#comment-14612999
] 

Stephan Erb commented on AURORA-1335:
-------------------------------------

[~wickman], do you have a hunch where the problem might be located? I will probably look into
this bug and a hint from you might speed this up significantly.

> Thermos should not immediately resort to killing processes
> ----------------------------------------------------------
>
>                 Key: AURORA-1335
>                 URL: https://issues.apache.org/jira/browse/AURORA-1335
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor, Thermos
>            Reporter: Stephan Erb
>
> As a user of Aurora, I would like my processes to be terminated in a graceful manner
so that they have time to properly flush their buffers and cleanup resources such as database
connections.
> In its current form, the executor sends a TERM signal which is immediately followed by
a KILL signal. As an example, see the timings in the following debug log output of a thermos
runner:
> {code}
> D0526 13:20:56.829274 29 ckpt.py:348] Flipping task state from ACTIVE to CLEANING
> D0526 13:20:56.829396 29 runner.py:242] _on_task_transition: TaskStatus(state=5, runner_uid=0,
runner_pid=29, timestamp_ms=1432639256829)
> D0526 13:20:56.829545 29 runner.py:188] Task on_cleaning(TaskStatus(state=5, runner_uid=0,
runner_pid=29, timestamp_ms=1432639256829))
> TaskRunnerHelper.terminate_process(service)
> D0526 13:20:56.832633 29 helper.py:238]    => SIGTERM pid 119
> D0526 13:20:56.832775 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
59.9783368111
> D0526 13:20:56.834014 118 process.py:103] [process:  118=service]: child state transition
[/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service]
<= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process=u'service',
start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447,
fork_time=None), runner_header=None)
> D0526 13:20:56.834566 118 process.py:103] [process:  118=service]: Coordinator exiting.
> D0526 13:20:56.835757 29 runner.py:873] Run loop: Work to be done within 1.0s
> D0526 13:20:56.836005 29 recordio.py:137] /var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service
has no data (current offset = 177)
> D0526 13:20:56.836102 29 muxer.py:155] select() returning 1 updates:
> D0526 13:20:56.836200 29 muxer.py:157]   = RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3,
process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4,
stop_time=1432639256.833447, fork_time=None), runner_header=None)
> D0526 13:20:56.836282 29 ckpt.py:379] Running state machine for process=service/seq=3
> D0526 13:20:56.836913 29 runner.py:238] _on_process_transition: ProcessStatus(seq=3,
process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4,
stop_time=1432639256.833447, fork_time=None)
> D0526 13:20:56.837102 29 runner.py:156] Process on_killed ProcessStatus(seq=3, process='service',
start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447,
fork_time=None)
> D0526 13:20:56.837189 29 helper.py:244] TaskRunnerHelper.kill_process(service)
> D0526 13:20:56.837582 29 helper.py:252]    => SIGKILL coordinator group 118
> D0526 13:20:56.837745 29 helper.py:255]    => SIGKILL coordinator 118
> D0526 13:20:56.838052 29 muxer.py:94] unregistering service
> D0526 13:20:56.838052 29 runner.py:160] Process killed, marking it as a loss.
> D0526 13:20:56.838052 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
59.9730448723
> D0526 13:20:56.844118 29 runner.py:873] Run loop: Work to be done within 1.0s
> D0526 13:20:56.894645 64 process.py:103] [process:   64=reverse_proxy]: child state transition
[/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.reverse_proxy]
<= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process=u'reverse_proxy',
start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.893275,
fork_time=None), runner_header=None)
> D0526 13:20:56.894645 64 process.py:103] [process:   64=reverse_proxy]: Coordinator exiting.
> D0526 13:20:57.849862 29 helper.py:376] Detected terminated process: pid=118, status=9,
rusage=resource.struct_rusage(ru_utime=0.008, ru_stime=0.024, ru_maxrss=19080, ru_ixrss=0,
ru_idrss=0, ru_isrss=0, ru_minflt=2448, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0,
ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=20, ru_nivcsw=14)
> D0526 13:20:57.850090 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
58.9610338211
> D0526 13:20:57.852466 29 runner.py:870] Run loop: No more work to be done in state CLEANING
> D0526 13:20:57.852730 29 ckpt.py:348] Flipping task state from CLEANING to FINALIZING
> {code}
> Expected behavior would be a that Thermos only resorts to killing when the application
does not honor the termination requests.
> Using the HTTP signals `/quitquitquit` and `/abortabortabort` is not an option due to
inherent security problems of the unauthenticated requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message