aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "brian wickman (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AURORA-176) more gracefully handle cases where user does not exist on machine
Date Mon, 03 Feb 2014 23:51:07 GMT
brian wickman created AURORA-176:
------------------------------------

             Summary: more gracefully handle cases where user does not exist on machine
                 Key: AURORA-176
                 URL: https://issues.apache.org/jira/browse/AURORA-176
             Project: Aurora
          Issue Type: Task
          Components: Thermos
            Reporter: brian wickman
            Priority: Minor


Tasks are going LOST for this reason:

{noformat}
Initialization of task runner failed: Could not construct sandbox: 'getpwnam(): name not found:
zmanji'
{noformat}

I think everything is actually behaving correctly but this should probably be propagated as
a task FAILED rather than LOST.

>From Jon B:

As far as the specific case described here (failure during task runner initialization), this
is actually done - it now results in FAILED:

{noformat}
2 mins ago - FAILED : Initialization of task runner failed: Could not construct sandbox: 'getpwnam():
name not found: cosmingheorghe'
{noformat}

However, it's still possible for a task to go LOST when a user doesn't exist any more, when
the runner terminates abnormally (after the sandbox is set up and task is already running):

{noformat}
D1024 21:17:11.360357 52726 runner.py:748] runnable: application
D1024 21:17:11.360486 52726 runner.py:750] waiting: 
D1024 21:17:11.361073 52726 runner.py:647] _set_process_status(application <= WAITING,
seq=4[auto])
D1024 21:17:11.361279 52726 ckpt.py:367] Running state machine for process=application/seq=4
D1024 21:17:11.361418 52726 runner.py:225] _on_process_transition: ProcessStatus(seq=4, process=u'application',
start_time=None, coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None,
fork_time=None)
D1024 21:17:11.428257 52726 runner.py:90] Process on_waiting ProcessStatus(seq=4, process=u'application',
start_time=None, coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None,
fork_time=None)
E1024 21:17:11.429981 52726 runner.py:545] Caught exception in self.control(): Unable to get
pwent information!
E1024 21:17:11.432827 52726 runner.py:546]   Traceback (most recent call last):
  File "twitter/thermos/runner/runner.py", line 543, in control
    yield
  File "twitter/thermos/runner/runner.py", line 831, in run
    self._run()
  File "twitter/thermos/runner/runner.py", line 839, in _run
    iteration_wait = runner.run()
  File "twitter/thermos/runner/runner.py", line 278, in run
    launched = self.runner._run_plan(self.runner._regular_plan)
  File "twitter/thermos/runner/runner.py", line 764, in _run_plan
    self._set_process_status(process_name, ProcessState.WAITING)
  File "twitter/thermos/runner/runner.py", line 650, in _set_process_status
    self._dispatcher.dispatch(self._state, runner_ckpt, self._recovery)
  File "twitter/thermos/base/ckpt.py", line 372, in dispatch
    self._run_process_dispatch(process_update.state, process_update)
  File "twitter/thermos/base/ckpt.py", line 200, in _run_process_dispatch
    getattr(handler, handler_function)(process_update)
  File "twitter/thermos/runner/runner.py", line 93, in on_waiting
    process_update.process, process_update.seq + 1))
  File "twitter/thermos/runner/runner.py", line 675, in _task_process_from_process_name
    fork=close_ckpt_and_fork)
  File "twitter/thermos/runner/process.py", line 287, in __init__
    ProcessBase.__init__(self, *args, **kw)
  File "twitter/thermos/runner/process.py", line 93, in __init__
    user, current_user = self._getpwuid() # may raise self.UnknownUserError
  File "twitter/thermos/runner/process.py", line 214, in _getpwuid
    raise self.UnknownUserError('Unable to get pwent information!')
UnknownUserError: Unable to get pwent information!
and then the executor marks it as LOST
I1024 21:17:11.947812 52673 status_manager.py:69] Executor polling thread detected termination
condition.
I1024 21:17:11.948065 52673 task_runner_wrapper.py:184] Runner is dead, skipping kill.
I1024 21:18:11.114897 52673 status_manager.py:109] Waiting for terminal state, current state:
ACTIVE
...
I1024 21:18:11.616302 52673 status_manager.py:109] Waiting for terminal state, current state:
ACTIVE
I1024 21:18:12.117630 52673 status_manager.py:125] State we've accepted: Thermos(ACTIVE) /
Failure: None
E1024 21:18:12.117769 52673 status_manager.py:129] Runner is dead but task state unexpectedly
ACTIVE!
...
D1024 21:18:12.664916 52673 ckpt.py:336] Flipping task state from FINALIZING to KILLED
D1024 21:18:12.665034 52673 runner.py:229] _on_task_transition: TaskStatus(state=3, runner_uid=0,
runner_pid=52673, timestamp_ms=1382649492664)
D1024 21:18:12.737503 52673 runner.py:188] Task on_killed(TaskStatus(state=3, runner_uid=0,
runner_pid=52673, timestamp_ms=1382649492664))
I1024 21:18:12.737726 52673 helper.py:125]   Coordinator stage_twemcache [pid: 52758] completed.
I1024 21:18:12.737891 52673 helper.py:136]   Process stage_twemcache [pid: 52759] completed.
I1024 21:18:12.738014 52673 runner.py:903] Transitioning application to LOST
D1024 21:18:12.738142 52673 helper.py:204] TaskRunnerHelper.kill_process(stage_twemcache)
I1024 21:18:12.738306 52673 helper.py:125]   Coordinator stage_twemcache [pid: 52758] completed.
I1024 21:18:12.738466 52673 helper.py:136]   Process stage_twemcache [pid: 52759] completed.
D1024 21:18:12.738584 52673 helper.py:212]    => SIGKILL coordinator group 52758
I1024 21:18:12.738967 52673 status_manager.py:150] Sending terminal state update: TASK_LOST
{noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message