airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bolke de Bruin (JIRA)" <>
Subject [jira] [Created] (AIRFLOW-20) Align start_date with the schedule_interval
Date Fri, 29 Apr 2016 13:59:13 GMT
Bolke de Bruin created AIRFLOW-20:

             Summary: Align start_date with the schedule_interval
                 Key: AIRFLOW-20
             Project: Apache Airflow
          Issue Type: Improvement
            Reporter: Bolke de Bruin

The need to align the start_date with the interval is counter intuitive
and leads to a lot of questions and issue creation, although it is in the documentation. If
we are
able to fix this with none or little consequences for current setups that should be preferred,
I think.
The dependency explainer is really great work, but it doesn’t address the core issue.

If you consider a DAG a description of cohesion between work items (in OOP java terms
a class), then a DagRun is the instantiation of a DAG in time (in OOP java terms an instance).

Tasks are then the description of a work item and a TaskInstance the instantiation of the
Task in time.

In my opinion issues pop up due to the current paradigm of considering the TaskInstance
the smallest unit of work and asking it to maintain its own state in relation to other TaskInstances
in a DagRun and in a previous DagRun of which it has no (real) perception. Tasks are instantiated
by a cartesian product with the dates of DagRun instead of the DagRuns itself. 

The very loose coupling between DagRuns and TaskInstances can be improved while maintaining
flexibility to run tasks without a DagRun. This would help with a couple of things:

1. start_date can be used as a ‘execution_date’ or a point in time when to start looking
2. a new interval for a dag will maintain depends_on_past
3. paused dags do not give trouble
4. tasks will be executed in order 
5. the ignore_first_depend_on_past could be removed as a task will now know if it is really
the first

In PR-1431 a lot of this work has been done by:

1. Adding a “previous” field to a DagRun allowing it to connect to its predecessor
2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the DagRun if needed
3. Using start_date + interval as the first run date unless start_date is on the interval
then start_date is the first run date

This message was sent by Atlassian JIRA

View raw message