aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Kumar Shanmugham (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AURORA-1837) Improve task history pruning
Date Sat, 11 Feb 2017 23:01:41 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566
] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:01 PM:
-----------------------------------------------------------------------------

Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
{{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set
to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become
ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}}
queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide
any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes
sense for scheduling work, however the same cannot be said for house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}})
and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping
work that is not in the critical scheduling path, it would make sense to rate-limit these
ambient activities, so that the scheduler is protected from bursts of non-critical work (like
- job updates with large number of instances, network-partition, cleaning up after scale-test).


One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into
the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling)
work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be
changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}},
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}}
which in-turn will release the work into the underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
{{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set
to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become
ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}}
queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide
any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload  makes
sense for scheduling work, however the same cannot be said for house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}})
and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping
work that is not in the critical scheduling path, it would make sense to rate-limit these
ambient activities, so that the scheduler is protected from bursts of non-critical work (like
- job updates with large number of instances, network-partition, cleaning up after scale-test).


One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into
the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling)
work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be
changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}},
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}}
which in-turn will be release the work into the underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> ----------------------------
>
>                 Key: AURORA-1837
>                 URL: https://issues.apache.org/jira/browse/AURORA-1837
>             Project: Aurora
>          Issue Type: Task
>            Reporter: Reza Motamedi
>            Assignee: Mehrdad Nurolahzade
>            Priority: Minor
>              Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks upon terminal
_state_ change for pruning. {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor
to schedule the process of pruning _task_s. However, we have noticed most of pruning takes
place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state transitions, have
it wake up on preconfigured intervals, find all terminal state tasks that meet pruning criteria
and delete them.
> # Make the initial task history pruning delay configurable so that it does not hamper
scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message