aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Erb <s...@apache.org>
Subject Review Request 63443: Terminate the executor on unhandled errors
Date Tue, 31 Oct 2017 16:17:25 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63443/
-----------------------------------------------------------

Review request for Aurora, Bill Farner and Zameer Manji.


Bugs: AURORA-1955
    https://issues.apache.org/jira/browse/AURORA-1955


Repository: aurora


Description
-------

This commit consits of two independent parts:

a) ensure we interrupt the main thread when there are unhandled exceptions
b) ensure the main thread of the executor can be interrupted


Diffs
-----

  src/main/python/apache/aurora/executor/bin/thermos_executor_main.py a191cf9eec844035c0f6aa5aed3731a06024c0df

  src/main/python/apache/aurora/tools/thermos.py de20c06cea5bbb45c7a6f5acfeee69289f8e6ad8

  src/main/python/apache/aurora/tools/thermos_observer.py 0318f990ac003c0b8925b7eb7359431cdee34f05

  src/main/python/apache/thermos/common/excepthook.py PRE-CREATION 
  src/main/python/apache/thermos/runner/thermos_runner.py 847f51ed2c0e003f1325aa903bd0f0b760acb365



Diff: https://reviews.apache.org/r/63443/diff/1/


Testing
-------

This bug is pretty hard to reproduce and test. I therefore opted for a manual 
verification and injected an exception throw shortly before the last statement 
of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in
hanging executors on the host. With this patch everything is terminated as
expected. 

For details of the suffessful run, please see the executor logs below. Please
note that the `apport.fileutils` is due to Ubuntu messing  with its Python
installation. This is not critical.

```
twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination
handler.)
I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0
I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0
Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d

ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon 139968452134656)>.
Interrupting main thread.
Traceback (most recent call last):
  File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 126, in _excepting_run
    self.__real_run(*args, **kw)
  File "apache/aurora/executor/status_manager.py", line 62, in run
  File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown
RuntimeError: Woops!
Exception in thread Thread-7 [TID=25450]:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
line 115, in identified
    return instancemethod(self, *args, **kwargs)
  File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 130, in _excepting_run
    sys.excepthook(*sys.exc_info())
  File "apache/thermos/common/excepthook.py", line 41, in teardown_handler
    self._former_hook()(exc_type, value, trace)
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
ImportError: No module named apport.fileutils

twitter.common.app debug: main exited with ^C
twitter.common.app debug: Shutting application down.
twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception
termination handler.)
twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
twitter.common.app debug: Finishing up module teardown.
twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 139968622749504)>
twitter.common.app debug:   Active thread (daemon): <TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c]
[TID=25449], started daemon 139967951009536)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-13, started daemon
139968485705472)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-9, started daemon
139967934224128)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-12, started daemon
139967942616832)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-3, started daemon
139968510883584)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-11, started daemon
139967925831424)>
twitter.common.app debug: Exiting cleanly.
```

Corresponding agent logs, indicating that Mesos knows about the crash on teardown:
```
I1031 15:59:54.692739  1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c'
of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130
I1031 15:59:54.692834  1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c'
of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931
I1031 15:59:54.692996  1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000
```


Thanks,

Stephan Erb


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message