airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jong Kim (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRFLOW-593) Tasks do not get backfilled sequentially
Date Tue, 25 Oct 2016 00:06:58 GMT
Jong Kim created AIRFLOW-593:
--------------------------------

             Summary: Tasks do not get backfilled sequentially
                 Key: AIRFLOW-593
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-593
             Project: Apache Airflow
          Issue Type: Bug
          Components: DagRun, scheduler
    Affects Versions: Airflow 1.7.1.3
            Reporter: Jong Kim
            Priority: Minor


I need to have the tasks within a DAG complete in order when running backfills. I am running
on my mac locally using SequentialExecutor.

Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a start_date: datetime(2016,
10, 20, 11, 0, 0). The DAG consists of 3 tasks, which must complete in order. task0 ->
task1 -> task2. This dependency is set using .set_downstream().

Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off toggle in the
webserver, and issue "airflow scheduler", which will automatically backfill starting from
start_date.

It will backfill for 2016/10/20 and 2016/10/21.  I expect backfill to run like the following
sequentially:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 20, 11, 0, 0) task2
datetime(2016, 10, 21, 11, 0, 0) task0
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task2

With 'depends_on_past': False, I see Airflow running tasks grouped by sequence number something
like this, which is not what I want:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 21, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 20, 11, 0, 0) task2
datetime(2016, 10, 21, 11, 0, 0) task2

With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to run like what
I need to, but instead it runs some tasks out of order like this:
datetime(2016, 10, 20, 11, 0, 0) task0
datetime(2016, 10, 20, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task0   <- out of order!
datetime(2016, 10, 20, 11, 0, 0) task2   <- out of order!
datetime(2016, 10, 21, 11, 0, 0) task1
datetime(2016, 10, 21, 11, 0, 0) task2

Is this a bug? If not, am I understanding 'depends_on_past' and 'wait_for_downstream' correctly?
What do I need to do?

The only remedy I can think of is to backfill each date manually.

Public gist of DAG: https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message