airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bolke de Bruin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-72) Implement proper capacity scheduler
Date Mon, 09 May 2016 06:50:12 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bolke de Bruin updated AIRFLOW-72:
----------------------------------
    Description: 
The scheduler is supposed to maintain queues and pools according to a "capacity" model. However
it is currently not properly implemented as therefore issues as being able to oversubscribe
to pools exist, race conditions for queuing/dequeuing exist and probably others.

This Jira Epic is to track all related issues to pooling/queuing and the (tbd) roadmap to
a proper capacity scheduler.

Why queuing / scheduling broken:

Locking is not properly implemented and cannot be as a check for slot availability is spread
throughout the scheduler, taskinstance and executor. This makes obtaining a slot non-atomic
and results in over subscribing. In addition it leads to race conditions as having two tasks
being picked from the queue at the same time as the scheduler determines that a queued task
still needs to be send to the executor, while in an earlier run this already happened.

In order to fix this Pool handling needs to be centralized (code wise) and work with a mutex
(with_for_update()) on the database records. The scheduler can then do something like:

slot = Pool.obtain_slot(pool_id)
Pool.release_slot(slot)




  was:
The scheduler is supposed to maintain queues and pools according to a "capacity" model. However
it is currently not properly implemented as therefore issues as being able to oversubscribe
to pools exist, race conditions for queuing/dequeuing exist and probably others.

This Jira Epic is to track all related issues to pooling/queuing and the (tbd) roadmap to
a proper capacity scheduler.




> Implement proper capacity scheduler
> -----------------------------------
>
>                 Key: AIRFLOW-72
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-72
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: pools, scheduler
>    Affects Versions: Airflow 1.7.1
>            Reporter: Bolke de Bruin
>              Labels: pool, queue, scheduler
>             Fix For: Airflow 2.0
>
>
> The scheduler is supposed to maintain queues and pools according to a "capacity" model.
However it is currently not properly implemented as therefore issues as being able to oversubscribe
to pools exist, race conditions for queuing/dequeuing exist and probably others.
> This Jira Epic is to track all related issues to pooling/queuing and the (tbd) roadmap
to a proper capacity scheduler.
> Why queuing / scheduling broken:
> Locking is not properly implemented and cannot be as a check for slot availability is
spread throughout the scheduler, taskinstance and executor. This makes obtaining a slot non-atomic
and results in over subscribing. In addition it leads to race conditions as having two tasks
being picked from the queue at the same time as the scheduler determines that a queued task
still needs to be send to the executor, while in an earlier run this already happened.
> In order to fix this Pool handling needs to be centralized (code wise) and work with
a mutex (with_for_update()) on the database records. The scheduler can then do something like:
> slot = Pool.obtain_slot(pool_id)
> Pool.release_slot(slot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message