aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehrdad Nurolahzade <mehr...@apache.org>
Subject Re: Review Request 56575: AURORA-1837 Improve task history pruning
Date Mon, 13 Feb 2017 17:30:59 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/
-----------------------------------------------------------

(Updated Feb. 13, 2017, 9:30 a.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.


Changes
-------

1. Review feedback
2. Added missing test-case
3. Added sleep cycles between processing jobs to soften the workload/heap blow


Bugs: AURORA-1837
    https://issues.apache.org/jira/browse/AURORA-1837


Repository: aurora


Description
-------

This patch addressed efficiency issues in the current implementation of `TaskHistoryPruner`.
The new design is similar to that of `JobUpdateHistoryPruner`: (a) Instead of registering
a `DelayExecutor` run upon terminal task state transitions, it runs on preconfigured intervals,
finds all terminal state tasks that meet pruning criteria and deletes them. (b) Makes the
initial task history pruning delay configurable so that it does not hamper scheduler upon
start.

The new design addressed the following two efficiecy problems:

1. Upon scheduler restart/failure, the in-memory state of task history pruning scheduled with
`DelayExecutor` is lost. `TaskHistoryPruner` learns about these dead tasks upon restart when
log is replayed. These expired tasks are picked up by the second call to `executor.execute()`
that performs job level pruning immediately (i.e., without delay). Hence, most task history
pruning happens after scheduler restarts and can severely hamper scheduler performance (or
cause consecutive fail-overs on test clusters when we put load test on scheduler).

2. Expired tasks can be picked up for pruning multiple times. The asynchronous nature of `BatchWorker`
which used to process task deletions introduces some delay between delete enqueue and delete
execution. As a result, tasks already queued for deletion in a previous evaluation round might
get picked up, evaluated and enqueued for deletion again. This is evident in `tasks_pruned`
metric which reflects numbers much higher than the actual number of expired tasks deleted.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26

  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java f77849498ff23616f1d56d133eb218f837ac3413

  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 14e4040e0b94e96f77068b41454311fa3bf53573


Diff: https://reviews.apache.org/r/56575/diff/


Testing
-------

Manual testing under Vagrant


Thanks,

Mehrdad Nurolahzade


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message