Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6229C200C18 for ; Sun, 12 Feb 2017 00:00:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 60D81160B4C; Sat, 11 Feb 2017 23:00:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 83A5B160B5B for ; Sun, 12 Feb 2017 00:00:48 +0100 (CET) Received: (qmail 435 invoked by uid 500); 11 Feb 2017 23:00:47 -0000 Mailing-List: contact issues-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list issues@aurora.apache.org Received: (qmail 423 invoked by uid 99); 11 Feb 2017 23:00:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 11 Feb 2017 23:00:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 25C18C19C3 for ; Sat, 11 Feb 2017 23:00:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.198 X-Spam-Level: X-Spam-Status: No, score=-1.198 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Nr6CFeob9tU7 for ; Sat, 11 Feb 2017 23:00:46 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id EFBF55F342 for ; Sat, 11 Feb 2017 23:00:45 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 94051E056B for ; Sat, 11 Feb 2017 23:00:43 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1857321D6B for ; Sat, 11 Feb 2017 23:00:43 +0000 (UTC) Date: Sat, 11 Feb 2017 23:00:43 +0000 (UTC) From: "Santhosh Kumar Shanmugham (JIRA)" To: issues@aurora.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (AURORA-1837) Improve task history pruning MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 11 Feb 2017 23:00:49 -0000 [ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566 ] Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:00 PM: ----------------------------------------------------------------------------- Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large number of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which in-turn will be release the work into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] was (Author: santhk): Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large number of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] > Improve task history pruning > ---------------------------- > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task > Reporter: Reza Motamedi > Assignee: Mehrdad Nurolahzade > Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks upon terminal _state_ change for pruning. {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to schedule the process of pruning _task_s. However, we have noticed most of pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state transitions, have it wake up on preconfigured intervals, find all terminal state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)