mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernd Mathiske" <be...@mesosphere.io>
Subject Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()
Date Thu, 09 Oct 2014 14:10:06 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23912/
-----------------------------------------------------------

(Updated Oct. 9, 2014, 7:10 a.m.)


Review request for mesos.


Changes
-------

Now not only the killed taks is removed, but also its executor's id and the whole framework.
Added a mock method and a check in the test to verify that removeFramework does get called.


Bugs: MESOS-947
    https://issues.apache.org/jira/browse/MESOS-947


Repository: mesos-git


Description
-------

Fixes MESOS-947 "Slave should properly handle a killTask() that arrives between runTask()
and _runTask()".

Slave::killTask() did not check for task in question combination to be "pending" (i.e. Slave::runTask
had happened, but Slave::_runTask had not yet) and then erroneously assumed that Slave::runTask()
had not been executed. The task was then marked "LOST" instead of "KILLED". But Slave::runTask
had already scheduled Slave::_runTask to follow. Now the entry for being "pending" is removed,
and the task is marked "KILLED", and _runTask gets informed about this. It checks whether
the task in question is currently "pending" and if it is not, then it infers that the task
has been killed and does not erroneously try to complete launching it.


Diffs (updated)
-----

  src/slave/slave.hpp 76d505c698774204b2536b66ea8a83a9a2a5e2c1 
  src/slave/slave.cpp cb3759993f863590cb1545c73072feb0331aa6c9 
  src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 
  src/tests/mesos.cpp 3dcb2acd5ad4ab5e3a7b4fe524ee077558112773 
  src/tests/slave_tests.cpp 69be28f6e82b99e23424bd2be8294f715d8040d4 

Diff: https://reviews.apache.org/r/23912/diff/


Testing
-------

Wrote a unit test that reliably created the situation described in the ticket. Observed that
TASK_LOST and the listed log output occurred. This pointed directly to the lines in killTask()
where the problem is rooted. Ran the test after fixing, it succeeded. Checked the log. It
looks like a "clean kill" now :-)


Thanks,

Bernd Mathiske


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message