aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Erb <s...@apache.org>
Subject Re: Review Request 60306: Ensure Thermos is not overloaded by an unlimited number of lost processes
Date Wed, 21 Jun 2017 22:48:28 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/
-----------------------------------------------------------

(Updated June 22, 2017, 12:48 a.m.)


Review request for Aurora.


Changes
-------

Review changes


Repository: aurora


Description
-------

Included changes:

* Thermos may consider launched processes to be LOST. Instead of
  restarting those immediately, the restarts will now be at least
  `min_duration` seconds apart. Restarts will also be capped at the
  TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
  observer will be overloaded by checkpoints. The handling of the LOST
  state is now consistent with the handling of both FAILED and FINISHED.
* Mark the success_transition and failure_transition as private. They
  are only used within `TaskPlanner` itself.
* Fix documented default of `min_duration` (i.e 5s rather than 15s).


Diffs (updated)
-----

  docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894 
  src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f

  src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9



Diff: https://reviews.apache.org/r/60306/diff/2/

Changes: https://reviews.apache.org/r/60306/diff/1-2/


Testing
-------

./pants test.pytest src/test/python::


File Attachments
----------------

massive_cpu_spike.png
  https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
massive_restart_count.png
  https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png


Thanks,

Stephan Erb


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message