aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amol S Deshmukh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AURORA-1642) Thermos runner finalization is broken
Date Fri, 18 Mar 2016 23:30:33 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202350#comment-15202350
] 

Amol S Deshmukh edited comment on AURORA-1642 at 3/18/16 11:30 PM:
-------------------------------------------------------------------

It appears that {{PipedSubprocessExecutor.start(...)}} was implemented to always use pipes
in order to support the {{TeeHandler}} (that was also introduced in r/40922) and {{RotatingFileHandler}}.

This is likely because {{subprocess.Popen(...)}} requires file objects or types that implement
{{fileno()}}. This could have been trivially handled without change to how Popen is invoked
by simply delegating {{FileHandler.fileno()}} in the subclasses to the wrapped file objects.
However, it would have not worked for TeeHandler (since there are 2 handles) and RotatingFileHandler
(which changes the file handles when rotation is performed).


was (Author: adeshmukh):
It appears that {{PipedSubprocessExecutor.start(...)}} was implemented to always use pipes
in order to support the {{TeeHandler}} (that was also introduced in r/40922) and {{RotatingFileHandler}}.

This is likely because {{subprocess.Popen(...)}} requires file objects or types that implement
{{fileno()}}. This could have been trivially handled without change to how Popen is invoked
by simply delegating {{FileHandler.fileno()}} in the subclasses to the wrapped file objects.
However, it would have not worked for all cases other than TeeHandler (since there are 2 handles)
and RotatingFileHandler (which changes the file handles when rotation is performed).

> Thermos runner finalization is broken
> -------------------------------------
>
>                 Key: AURORA-1642
>                 URL: https://issues.apache.org/jira/browse/AURORA-1642
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>            Reporter: Maxim Khutornenko
>            Assignee: Amol S Deshmukh
>
> We have noticed thermos runner finalization no longer works after this commit [024bac9dcb8f37e4b31210e3a0a7aea2345a16ab|https://reviews.apache.org/r/40922/]
for tasks with blocking threads. 
> I was able to reproduce it in Vagrant by extending the sleep timeout of the {{hello}}
task and running {{aurora job killall}} immediately after launching it:
> {noformat}
> while true; do
>       echo hello world
>       sleep 600
> {noformat}
> The finalizer never has a chance to run and after 1 minute a task is forcefully aborted:
> {noformat}
> D0316 04:00:35.237905 19362 runner.py:951] Runner issued kill: force:False, preemption_wait:1
mins
> D0316 04:00:35.238183 19362 runner.py:567] Flipping recovery mode off.
> D0316 04:00:35.238308 19362 ckpt.py:348] Flipping task state from ACTIVE to ACTIVE
> D0316 04:00:35.238437 19362 runner.py:242] _on_task_transition: TaskStatus(state=0, runner_uid=0,
runner_pid=19362, timestamp_ms=1458100835238)
> D0316 04:00:35.239079 19362 runner.py:180] Task on_active(TaskStatus(state=0, runner_uid=0,
runner_pid=19362, timestamp_ms=1458100835238))
> D0316 04:00:35.241660 19362 ckpt.py:348] Flipping task state from ACTIVE to CLEANING
> D0316 04:00:35.241765 19362 runner.py:242] _on_task_transition: TaskStatus(state=5, runner_uid=0,
runner_pid=19362, timestamp_ms=1458100835241)
> D0316 04:00:35.249836 19362 runner.py:188] Task on_cleaning(TaskStatus(state=5, runner_uid=0,
runner_pid=19362, timestamp_ms=1458100835241))
> D0316 04:00:35.249953 19362 helper.py:217] TaskRunnerHelper.terminate_process(hello)
> D0316 04:00:35.256520 19362 helper.py:220]    => SIGTERM pid 19368
> D0316 04:00:35.256705 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
59.9812531471
> D0316 04:00:35.262578 19362 runner.py:929] Run loop: Work to be done within 1.0s
> D0316 04:00:36.263881 19362 runner.py:939] Run loop: No updates collected, touching checkpoint.
> D0316 04:00:36.264199 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
58.9737620354
> D0316 04:00:36.264734 19362 runner.py:929] Run loop: Work to be done within 1.0s
> ---<skipped>---
> D0316 04:01:31.397888 19362 runner.py:939] Run loop: No updates collected, touching checkpoint.
> D0316 04:01:31.398144 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
3.83981513977
> D0316 04:01:31.398538 19362 runner.py:929] Run loop: Work to be done within 1.0s
> D0316 04:01:32.400230 19362 runner.py:939] Run loop: No updates collected, touching checkpoint.
> D0316 04:01:32.401125 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
2.8368370533
> D0316 04:01:32.401596 19362 runner.py:929] Run loop: Work to be done within 1.0s
> D0316 04:01:33.404506 19362 runner.py:939] Run loop: No updates collected, touching checkpoint.
> D0316 04:01:33.404815 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
1.83315014839
> D0316 04:01:33.405534 19362 runner.py:929] Run loop: Work to be done within 1.0s
> D0316 04:01:34.406909 19362 runner.py:939] Run loop: No updates collected, touching checkpoint.
> D0316 04:01:34.407223 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
0.830743074417
> D0316 04:01:34.407908 19362 runner.py:929] Run loop: Work to be done within 0.8s
> D0316 04:01:35.415529 19362 runner.py:939] Run loop: No updates collected, touching checkpoint.
> D0316 04:01:35.415683 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
0
> D0316 04:01:35.415740 19362 runner.py:926] Run loop: No more work to be done in state
CLEANING
> D0316 04:01:35.415888 19362 runner.py:903] Forced terminal state: KILLED
> D0316 04:01:35.415936 19362 ckpt.py:348] Flipping task state from CLEANING to KILLED
> D0316 04:01:35.415980 19362 runner.py:242] _on_task_transition: TaskStatus(state=3, runner_uid=0,
runner_pid=19362, timestamp_ms=1458100895415)
> D0316 04:01:35.416937 19362 runner.py:201] Task on_killed(TaskStatus(state=3, runner_uid=0,
runner_pid=19362, timestamp_ms=1458100895415))
> D0316 04:01:35.417393 19362 runner.py:684] _set_process_status(hello <= KILLED, seq=3[auto])
> D0316 04:01:35.417458 19362 ckpt.py:379] Running state machine for process=hello/seq=3
> D0316 04:01:35.417460 19362 runner.py:238] _on_process_transition: ProcessStatus(seq=3,
process=u'hello', start_time=None, coordinator_pid=None, pid=None, return_code=-1, state=4,
stop_time=1458100895.417381, fork_time=None)
> D0316 04:01:35.417853 19362 runner.py:156] Process on_killed ProcessStatus(seq=3, process=u'hello',
start_time=None, coordinator_pid=None, pid=None, return_code=-1, state=4, stop_time=1458100895.417381,
fork_time=None)
> D0316 04:01:35.417921 19362 helper.py:226] TaskRunnerHelper.kill_process(hello)
> D0316 04:01:35.418145 19362 helper.py:234]    => SIGKILL coordinator group 19367
> D0316 04:01:35.418934 19362 helper.py:237]    => SIGKILL coordinator 19367
> D0316 04:01:35.419006 19362 muxer.py:94] unregistering hello
> D0316 04:01:35.419069 19362 runner.py:160] Process killed, marking it as a loss.
> {noformat}
> After reverting the above commit, the task is killed almost instantaneously with the
finalizer exiting properly:
> {noformat}
> D0316 04:03:42.339101 20053 runner.py:939] Runner issued kill: force:False, preemption_wait:1
mins
> D0316 04:03:42.339370 20053 runner.py:564] Flipping recovery mode off.
> D0316 04:03:42.339503 20053 ckpt.py:348] Flipping task state from ACTIVE to ACTIVE
> D0316 04:03:42.339571 20053 runner.py:242] _on_task_transition: TaskStatus(state=0, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101022339)
> D0316 04:03:42.340327 20053 runner.py:180] Task on_active(TaskStatus(state=0, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101022339))
> D0316 04:03:42.343360 20053 ckpt.py:348] Flipping task state from ACTIVE to CLEANING
> D0316 04:03:42.343463 20053 runner.py:242] _on_task_transition: TaskStatus(state=5, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101022343)
> D0316 04:03:42.344290 20053 runner.py:188] Task on_cleaning(TaskStatus(state=5, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101022343))
> D0316 04:03:42.344388 20053 helper.py:217] TaskRunnerHelper.terminate_process(hello)
> D0316 04:03:42.349311 20053 helper.py:220]    => SIGTERM pid 20059
> D0316 04:03:42.349566 20058 process.py:132] [process:20058=hello]: child state transition
[/var/lib/mesos/slaves/cf74b250-4134-4227-8f86-b48a67e15b89-S0/frameworks/cf74b250-4134-4227-8f86-b48a67e15b89-0000/executors/thermos-www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/runs/0310abe1-3a24-4308-9c5e-6df7714ca2f2/checkpoints/checkpoints/www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/coordinator.hello]
<= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process=u'hello', start_time=None,
coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1458101022.34949, fork_time=None),
runner_header=None)
> D0316 04:03:42.349777 20053 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
59.9893760681
> D0316 04:03:42.350320 20058 process.py:132] [process:20058=hello]: Coordinator exiting.
> D0316 04:03:42.360085 20053 runner.py:917] Run loop: Work to be done within 1.0s
> D0316 04:03:42.360292 20053 recordio.py:137] /var/lib/mesos/slaves/cf74b250-4134-4227-8f86-b48a67e15b89-S0/frameworks/cf74b250-4134-4227-8f86-b48a67e15b89-0000/executors/thermos-www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/runs/0310abe1-3a24-4308-9c5e-6df7714ca2f2/checkpoints/checkpoints/www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/coordinator.hello
has no data (current offset = 171)
> D0316 04:03:42.360383 20053 muxer.py:155] select() returning 1 updates:
> D0316 04:03:42.360440 20053 muxer.py:157]   = RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3,
process='hello', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4,
stop_time=1458101022.34949, fork_time=None), runner_header=None)
> D0316 04:03:42.360486 20053 ckpt.py:379] Running state machine for process=hello/seq=3
> D0316 04:03:42.360543 20053 runner.py:238] _on_process_transition: ProcessStatus(seq=3,
process='hello', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4,
stop_time=1458101022.34949, fork_time=None)
> D0316 04:03:42.361098 20053 runner.py:156] Process on_killed ProcessStatus(seq=3, process='hello',
start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1458101022.34949,
fork_time=None)
> D0316 04:03:42.361164 20053 helper.py:226] TaskRunnerHelper.kill_process(hello)
> D0316 04:03:42.361391 20053 helper.py:234]    => SIGKILL coordinator group 20058
> D0316 04:03:42.361438 20053 helper.py:237]    => SIGKILL coordinator 20058
> D0316 04:03:42.361488 20053 muxer.py:94] unregistering hello
> D0316 04:03:42.361536 20053 runner.py:160] Process killed, marking it as a loss.
> D0316 04:03:42.361555 20053 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
59.9775869846
> D0316 04:03:42.361555 20053 runner.py:917] Run loop: Work to be done within 1.0s
> D0316 04:03:42.361555 20053 runner.py:925] Update collection only took 0.0s, idling 1.0s
> D0316 04:03:43.361848 20053 runner.py:927] Run loop: No updates collected, touching checkpoint.
> D0316 04:03:43.362040 20053 helper.py:358] Detected terminated process: pid=20058, status=9,
rusage=resource.struct_rusage(ru_utime=0.00684, ru_stime=0.008775999999999999, ru_maxrss=16288,
ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=4105, ru_majflt=0, ru_nswap=0, ru_inblock=0,
ru_oublock=48, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=9, ru_nivcsw=14)
> D0316 04:03:43.362119 20053 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining:
58.9770309925
> D0316 04:03:43.362361 20053 runner.py:914] Run loop: No more work to be done in state
CLEANING
> D0316 04:03:43.362416 20053 ckpt.py:348] Flipping task state from CLEANING to FINALIZING
> D0316 04:03:43.362462 20053 runner.py:242] _on_task_transition: TaskStatus(state=6, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101023362)
> D0316 04:03:43.363035 20053 runner.py:193] Task on_finalizing(TaskStatus(state=6, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101023362))
> D0316 04:03:43.363317 20053 helper.py:226] TaskRunnerHelper.kill_process(hello)
> D0316 04:03:43.363497 20053 helper.py:234]    => SIGKILL coordinator group 20058
> D0316 04:03:43.363550 20053 runner.py:801] Schedule pass:
> D0316 04:03:43.363619 20053 runner.py:804] running: 
> D0316 04:03:43.363662 20053 runner.py:805] finished: 
> D0316 04:03:43.363723 20053 runner.py:815] runnable: 
> D0316 04:03:43.363761 20053 runner.py:817] waiting: 
> D0316 04:03:43.363961 20053 runner.py:348] TaskRunnerStage[FINALIZING]: Finalization
remaining: 58.9751880169
> D0316 04:03:43.364037 20053 runner.py:753] running:0 runnable:0 waiting:0 complete:True
> D0316 04:03:43.364084 20053 runner.py:914] Run loop: No more work to be done in state
FINALIZING
> D0316 04:03:43.364084 20053 runner.py:891] Forced terminal state: KILLED
> D0316 04:03:43.364084 20053 ckpt.py:348] Flipping task state from FINALIZING to KILLED
> D0316 04:03:43.364084 20053 runner.py:242] _on_task_transition: TaskStatus(state=3, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101023364)
> D0316 04:03:43.364553 20053 runner.py:201] Task on_killed(TaskStatus(state=3, runner_uid=0,
runner_pid=20053, timestamp_ms=1458101023364))
> D0316 04:03:43.364784 20053 helper.py:226] TaskRunnerHelper.kill_process(hello)
> D0316 04:03:43.364965 20053 helper.py:234]    => SIGKILL coordinator group 20058 
> {noformat}
> Root cause still TBD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message