Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 776962004C8 for ; Mon, 9 May 2016 08:50:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 72B3A16098A; Mon, 9 May 2016 06:50:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BA1801601D4 for ; Mon, 9 May 2016 08:50:16 +0200 (CEST) Received: (qmail 10574 invoked by uid 500); 9 May 2016 06:50:16 -0000 Mailing-List: contact commits-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list commits@airflow.incubator.apache.org Received: (qmail 10565 invoked by uid 99); 9 May 2016 06:50:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2016 06:50:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 6DE13C0E17 for ; Mon, 9 May 2016 06:50:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.021 X-Spam-Level: X-Spam-Status: No, score=-4.021 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id VjAfJItixxCp for ; Mon, 9 May 2016 06:50:14 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id E6C3A5F299 for ; Mon, 9 May 2016 06:50:13 +0000 (UTC) Received: (qmail 10554 invoked by uid 99); 9 May 2016 06:50:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2016 06:50:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C71C22C14F4 for ; Mon, 9 May 2016 06:50:12 +0000 (UTC) Date: Mon, 9 May 2016 06:50:12 +0000 (UTC) From: "Bolke de Bruin (JIRA)" To: commits@airflow.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AIRFLOW-72) Implement proper capacity scheduler MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 09 May 2016 06:50:17 -0000 [ https://issues.apache.org/jira/browse/AIRFLOW-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bolke de Bruin updated AIRFLOW-72: ---------------------------------- Description: The scheduler is supposed to maintain queues and pools according to a "capacity" model. However it is currently not properly implemented as therefore issues as being able to oversubscribe to pools exist, race conditions for queuing/dequeuing exist and probably others. This Jira Epic is to track all related issues to pooling/queuing and the (tbd) roadmap to a proper capacity scheduler. Why queuing / scheduling broken: Locking is not properly implemented and cannot be as a check for slot availability is spread throughout the scheduler, taskinstance and executor. This makes obtaining a slot non-atomic and results in over subscribing. In addition it leads to race conditions as having two tasks being picked from the queue at the same time as the scheduler determines that a queued task still needs to be send to the executor, while in an earlier run this already happened. In order to fix this Pool handling needs to be centralized (code wise) and work with a mutex (with_for_update()) on the database records. The scheduler can then do something like: slot = Pool.obtain_slot(pool_id) Pool.release_slot(slot) was: The scheduler is supposed to maintain queues and pools according to a "capacity" model. However it is currently not properly implemented as therefore issues as being able to oversubscribe to pools exist, race conditions for queuing/dequeuing exist and probably others. This Jira Epic is to track all related issues to pooling/queuing and the (tbd) roadmap to a proper capacity scheduler. > Implement proper capacity scheduler > ----------------------------------- > > Key: AIRFLOW-72 > URL: https://issues.apache.org/jira/browse/AIRFLOW-72 > Project: Apache Airflow > Issue Type: Improvement > Components: pools, scheduler > Affects Versions: Airflow 1.7.1 > Reporter: Bolke de Bruin > Labels: pool, queue, scheduler > Fix For: Airflow 2.0 > > > The scheduler is supposed to maintain queues and pools according to a "capacity" model. However it is currently not properly implemented as therefore issues as being able to oversubscribe to pools exist, race conditions for queuing/dequeuing exist and probably others. > This Jira Epic is to track all related issues to pooling/queuing and the (tbd) roadmap to a proper capacity scheduler. > Why queuing / scheduling broken: > Locking is not properly implemented and cannot be as a check for slot availability is spread throughout the scheduler, taskinstance and executor. This makes obtaining a slot non-atomic and results in over subscribing. In addition it leads to race conditions as having two tasks being picked from the queue at the same time as the scheduler determines that a queued task still needs to be send to the executor, while in an earlier run this already happened. > In order to fix this Pool handling needs to be centralized (code wise) and work with a mutex (with_for_update()) on the database records. The scheduler can then do something like: > slot = Pool.obtain_slot(pool_id) > Pool.release_slot(slot) -- This message was sent by Atlassian JIRA (v6.3.4#6332)