aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: Review Request 60306: Ensure Thermos is not overloaded by an unlimited number of lost processes
Date Wed, 21 Jun 2017 22:09:48 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/#review178571
-----------------------------------------------------------


Ship it!




I once noticed this in production but I neglected to investigate further.

Thanks for this fix.


src/main/python/apache/thermos/common/planner.py
Lines 307 (patched)
<https://reviews.apache.org/r/60306/#comment252636>

    There is duplication here with the failure case.
    
    It might be worthwhile to extract this logic out to a private method instead of duplicating
it here.


- Zameer Manji


On June 21, 2017, 2:37 p.m., Stephan Erb wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60306/
> -----------------------------------------------------------
> 
> (Updated June 21, 2017, 2:37 p.m.)
> 
> 
> Review request for Aurora.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Included changes:
> 
> * Thermos may consider launched processes to be LOST. Instead of
>   restarting those immediately, the restarts will now be at least
>   `min_duration` seconds apart. Restarts will also be capped at the
>   TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
>   observer will be overloaded by checkpoints. The handling of the LOST
>   state is now consistent with the handling of both FAILED and FINISHED.
> * Mark the success_transition and failure_transition as private. They
>   are only used within `TaskPlanner` itself.
> * Fix documented default of `min_duration` (i.e 5s rather than 15s).
> 
> 
> Diffs
> -----
> 
>   docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894 
>   src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f

>   src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9

> 
> 
> Diff: https://reviews.apache.org/r/60306/diff/1/
> 
> 
> Testing
> -------
> 
> ./pants test.pytest src/test/python::
> 
> 
> File Attachments
> ----------------
> 
> massive_cpu_spike.png
>   https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
> massive_restart_count.png
>   https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message