airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jong Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-593) Tasks do not get backfilled sequentially
Date Tue, 29 Aug 2017 23:43:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146348#comment-16146348
] 

Jong Kim commented on AIRFLOW-593:
----------------------------------

Any update on this? I consider this a pretty serious bug. The public gist of DAG above can
easily be run to verify my claim.

The current workaround I have is to backfill each DAG run manually one at a time...which diminishes
the "backfill" nature of this command.

> Tasks do not get backfilled sequentially
> ----------------------------------------
>
>                 Key: AIRFLOW-593
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-593
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DagRun, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Jong Kim
>            Priority: Minor
>
> I need to have the tasks within a DAG complete in order when running backfills. I am
running on my mac locally using SequentialExecutor.
> Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a start_date: datetime(2016,
10, 20, 11, 0, 0). The DAG consists of 3 tasks, which must complete in order. task0 ->
task1 -> task2. This dependency is set using .set_downstream().
> Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off toggle in
the webserver, and issue "airflow scheduler", which will automatically backfill starting from
start_date.
> It will backfill for 2016/10/20 and 2016/10/21.  I expect backfill to run like the following
sequentially:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': False, I see Airflow running tasks grouped by sequence number
something like this, which is not what I want:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to run like
what I need to, but instead it runs some tasks out of order like this:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task0   <- out of order!
> datetime(2016, 10, 20, 11, 0, 0) task2   <- out of order!
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> Is this a bug? If not, am I understanding 'depends_on_past' and 'wait_for_downstream'
correctly? What do I need to do?
> The only remedy I can think of is to backfill each date manually.
> Public gist of DAG: https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message