airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Riccomini (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-20) Improving the scheduler by making dag runs more coherent
Date Tue, 03 May 2016 15:29:12 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Riccomini updated AIRFLOW-20:
-----------------------------------
    Component/s: scheduler

> Improving the scheduler by making dag runs more coherent
> --------------------------------------------------------
>
>                 Key: AIRFLOW-20
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-20
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>            Reporter: Bolke de Bruin
>              Labels: backfill, database, scheduler
>
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the documentation.
If we are
> able to fix this with none or little consequences for current setups that should be preferred,
I think.
> The dependency explainer is really great work, but it doesn’t address the core issue.
> If you consider a DAG a description of cohesion between work items (in OOP java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java terms an instance).

> Tasks are then the description of a work item and a TaskInstance the instantiation of
the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation to other
TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception. Tasks are
instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to start
looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know if it is
really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on the interval
then start_date is the first run date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message