airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bolke de Bruin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AIRFLOW-20) Align start_date with the schedule_interval
Date Sat, 30 Apr 2016 19:41:12 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265448#comment-15265448
] 

Bolke de Bruin edited comment on AIRFLOW-20 at 4/30/16 7:40 PM:
----------------------------------------------------------------

I took the plunge and made the "dag_run_id" field for TaskInstances NOT NULL. This allowed
me to catch places where TaskInstances are manipulated and to fix these. So indeed there are
some strange places like the PythonOperator and views.py that create TaskInstances. That is
really a refactor task towards the future (eg. dao principle would be smart maybe)

The good news is these have all been fixed. All tests pass, including some additional ones
I added. 

Now I can dive in to some other cases. Like what to do when a backfill has an execution_date
that is equal to a previous scheduled run or backfill? 

1. Do we update the tasks and run new ones and skip ones that were successful? If so do we
connect old ones to the backfill run or do we leave them as is (ie connected to the old run).
Think lineage here. How do we handle dependencies if the old tasks stay connected to their
previous run?
2. Do we refuse to run? That would disallow you to "add tasks" in the past.

I tend to lean towards no 1 and connect all tasks to the backfill and setting the state of
the "old" dagrun to "SUPERSEDED" or "OVERRIDDEN". The same flexibility as today would exist
but an audit trail is preserved.




was (Author: bolke):
I took the plunge and made the "dag_run_id" field for TaskInstances NOT NULL. This allowed
me to catch places where TaskInstances are manipulated and to fix these. So indeed there are
some strange places like the PythonOperator and views.py the create TaskInstances.

The good news is these have all been fixed. All tests pass, including some additional ones
I added. 

Now I can dive in to some other cases. Like what to do when a backfill has an execution_date
that is equal to a previous scheduled run or backfill? 

1. Do we update the tasks and run new ones and skip ones that were successful? If so do we
connect old ones to the backfill run or do we leave them as is (ie connected to the old run).
Think lineage here. How do we handle dependencies if the old tasks stay connected to their
previous run?
2. Do we refuse to run? That would disallow you to "add tasks" in the past.

I tend to lean towards no 1 and connect all tasks to the backfill and setting the state of
the "old" dagrun to "SUPERSEDED" or "OVERRIDDEN". The same flexibility as today would exist
but an audit trail is preserved.



> Align start_date with the schedule_interval
> -------------------------------------------
>
>                 Key: AIRFLOW-20
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-20
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: Bolke de Bruin
>              Labels: backfill, database, scheduler
>
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the documentation.
If we are
> able to fix this with none or little consequences for current setups that should be preferred,
I think.
> The dependency explainer is really great work, but it doesn’t address the core issue.
> If you consider a DAG a description of cohesion between work items (in OOP java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java terms an instance).

> Tasks are then the description of a work item and a TaskInstance the instantiation of
the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation to other
TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception. Tasks are
instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to start
looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know if it is
really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on the interval
then start_date is the first run date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message