airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-192) Implement priority_weight aggregation using ancestors (rather than successors)
Date Thu, 18 Jan 2018 15:11:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330621#comment-16330621
] 

ASF subversion and git services commented on AIRFLOW-192:
---------------------------------------------------------

Commit dd2bc8cb971d25087a35db16d12592f759ecbc6a in incubator-airflow's branch refs/heads/master
from [~wongwill]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=dd2bc8c ]

[AIRFLOW-192] Add weight_rule param to BaseOperator

Improved task generation performance significantly
by using sets of
task_ids and dag_ids instead of lists when
calculating total priority
weight.

Closes #2941 from wongwill86/performance-latest


> Implement priority_weight aggregation using ancestors (rather than successors)
> ------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-192
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-192
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: operators
>    Affects Versions: Airflow 1.7.1.2
>            Reporter: Sergei Iakhnin
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently tasks are being scheduled based on the priority_weight. The effective priority
of a task is it's own priority plus the priorities of all tasks that follow it in a dag. This
results in undesirable scheduling behaviour in my use case.
> My use case involves running scientific workflows where a number of operations are being
carried out on a set of samples in a set. Each sample is handled by a separate dag run that
is manually triggered. It is common for several thousand dag instances to be in flight at
a given time. The dag reserves a sample, operates on it, and then releases it. I would like
for each sample to be reserved for as short a time as possible, so that other programs can
have an opportunity to operate on it and dag runs can complete as fast as possible. However,
because of the current priority logic, if I were to schedule several thousand dags at a given
time, they would first all execute their first state, then all execute their second state,
etc. Thus, no dag can complete fully, until all dags complete their second last state. This
results in unnecessarily long dag run times and simultaneous completion of all dags.
> Ideally, Airflow would support the reverse of the current logic used for priorities i.e.
a task's priority is the sum of priorities of all its ancestors. This way, the further along
a dag is in its processing the more likely its tasks will get scheduled (thus leading to a
shorter completion time, and release of its resources).
> Also, a nominal priority mode would be useful, where a task's priority is exactly the
number given to it by the author, in order to allow more scheduling flexibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message