airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jin Mingjian (JIRA)" <>
Subject [jira] [Assigned] (AIRFLOW-20) Improving the scheduler by making dag runs more coherent
Date Wed, 06 Sep 2017 11:31:00 GMT


Jin Mingjian reassigned AIRFLOW-20:

    Assignee: Jin Mingjian

> Improving the scheduler by making dag runs more coherent
> --------------------------------------------------------
>                 Key: AIRFLOW-20
>                 URL:
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>            Reporter: Bolke de Bruin
>            Assignee: Jin Mingjian
>              Labels: backfill, database, scheduler
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the documentation.
If we are
> able to fix this with none or little consequences for current setups that should be preferred,
I think.
> The dependency explainer is really great work, but it doesn’t address the core issue.
> If you consider a DAG a description of cohesion between work items (in OOP java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java terms an instance).

> Tasks are then the description of a work item and a TaskInstance the instantiation of
the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation to other
> in a DagRun and in a previous DagRun of which it has no (real) perception. Tasks are
> by a cartesian product with the dates of DagRun instead of the DagRuns itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to start
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know if it is
really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on the interval
then start_date is the first run date

This message was sent by Atlassian JIRA

View raw message