aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zameer Manji (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1801) TaskObserver thread stops refreshing after filesystem race condition
Date Wed, 26 Oct 2016 21:30:58 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609736#comment-15609736
] 

Zameer Manji commented on AURORA-1801:
--------------------------------------

I am a big fan of making the process fail if the `TaskObserver` thread fails.

That matches up with patterns elsewhere in the code.

We can also prevent the race condition too.

> TaskObserver thread stops refreshing after filesystem race condition
> --------------------------------------------------------------------
>
>                 Key: AURORA-1801
>                 URL: https://issues.apache.org/jira/browse/AURORA-1801
>             Project: Aurora
>          Issue Type: Bug
>          Components: Observer
>            Reporter: Stephan Erb
>
> It seems like that a race condition accessing the Mesos filesystem layout can bubble
up and terminate the {{TaskObserver}} thread responsible for refreshing the internal data
structure of available tasks. Restarting the observer fixes the problem.
> Exception triggering the issue:
> {code}
> Traceback (most recent call last):
>   File "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.bce9e54ac7cded79a75603fb4e6bcef2c7d1e6bc/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 126, in _excepting_run
>     self.__real_run(*args, **kw)
>   File "apache/thermos/observer/task_observer.py", line 135, in run
>   File "apache/thermos/observer/detector.py", line 74, in refresh
>   File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors
>   File "apache/aurora/executor/common/path_detector.py", line 34, in get_paths
>   File "apache/aurora/executor/common/path_detector.py", line 34, in <genexpr>
>   File "apache/aurora/executor/common/path_detector.py", line 33, in iterate
>   File "/usr/lib/python2.7/posixpath.py", line 376, in realpath
>     resolved = _resolve_link(component)
>   File "/usr/lib/python2.7/posixpath.py", line 399, in _resolve_link
>     resolved = os.readlink(path)
> OSError: [Errno 2] No such file or directory: '/var/lib/mesos/slaves/0768bcb3-205d-4409-a726-3001ad3ef902-S10/frameworks/20151001-085346-58917130-5050-37976-0000/executors/thermos-role-env-myname-0-f9fe0318-d39f-49d3-bdf8-e954d5879b33/runs/latest'
> {code}
> Solution space:
> * terminate the observer process if the {{TaskOberver}} thread fails
> * prevent unknown exceptions from aborting the  {{TaskOberver}} run loop
> * prevent the observed race condition in {{detector.py}} or {{path_detector.py}}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message