airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin McHale (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRFLOW-3285) lazy marking of upstream_failed task state
Date Thu, 01 Nov 2018 18:14:00 GMT
Kevin McHale created AIRFLOW-3285:
-------------------------------------

             Summary: lazy marking of upstream_failed task state
                 Key: AIRFLOW-3285
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3285
             Project: Apache Airflow
          Issue Type: Improvement
            Reporter: Kevin McHale


Airflow aggressively applies the {{upstream_failed}} task state: as soon as a task fails,
all of its downstream dependencies get marked.  This sometimes creates problems for us at
Etsy.

In particular, we use a pattern for our hadoop Airflow DAGs along these lines:
 # the DAG creates a hadoop cluster in GCP/Dataproc
 # the DAG executes its tasks on the cluster
 # the DAG deletes the cluster once all tasks are done

There are some cases in which the tasks immediately upstream of the cluster-delete step get
marked as {{upstream_failed}}, triggering the cluster-delete step, even while other tasks
continue to execute without problems on the cluster.  The cluster-delete step of course kills
all of the running tasks, requiring all of them to be re-run once the problem with the failed
task is mitigated.

As an example, a DAG that looks like this can exhibit the problem:
{code:java}
Cluster = ClusterCreateOperator(...)

A = Job1Operator(...)
Cluster << A

B = Job2Operator(...)
Cluster << B

C = Job3Operator(...)
A << C
B << C

ClusterDelete = DeleteClusterOperator(trigger_rule="all_done", ...)
D << ClusterDelete{code}
In a DAG like this, suppose task A fails while task B is running.  Task C will immediately
be marked as {{upstream_failed}}, which will cause ClusterDelete to run while task B is still
running, which will cause task B to also fail.

Our solution to this problem has been to implement something like [this diff|https://github.com/mchalek/incubator-airflow/commit/585349018656cd9b2e3e3e113db6412345485dde], which
lazily applies the {{upstream_failed}} state only to tasks for which all upstream tasks have
already completed.

The consequence in terms of the example above is that task C will not be marked {{upstream_failed}} in
response to task A failing until task B completes, ensuring that the cluster is not deleted
while any upstream tasks are running.

We find this not to have any adverse behavior on our airflow instances, so we run all of them
with this lazy-marking feature enabled.  However, we recognize that a change in behavior
like this may be something that existing users will want to opt-in for, so we included a config
flag in the diff that defaults to the original behavior.

We would appreciate your consideration of incorporating this diff, or something like it, to
allow us to configure this behavior in unmodified, upstream airflow.

Thanks!

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message