airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: airflow start_date confusion:
Date Sun, 12 Jun 2016 23:42:22 GMT
From: http://pythonhosted.org/airflow/faq.html

*What’s the deal with ``start_date``?*

start_date is partly legacy from the pre-DagRun era, but it is still
relevant in many ways. When creating a new DAG, you probably want to set a
global start_date for your tasks usingdefault_args. The first DagRun to be
created will be based on the min(start_date) for all your task. From that
point on, the scheduler creates new DagRuns based on your schedule_interval and
the corresponding task instances run as your dependencies are met. When
introducing new tasks to your DAG, you need to pay special attention to
start_date, and may want to reactivate inactive DagRuns to get the new task
to get onboarded properly.

We recommend against using dynamic values as start_date, especially
datetime.now() as it can be quite confusing. The task is triggered once the
period closes, and in theory an @hourly DAG would never get to an hour
after now as now() moves along.

We also recommend using rounded start_date in relation to your
schedule_interval. This means an @hourly would be at 00:00 minutes:seconds,
a @daily job at midnight, a @monthly job on the first of the month. You can
use any sensor or a TimeDeltaSensor to delay the execution of tasks within
that period. While schedule_interval does allow specifying a
datetime.timedelta object, we recommend using the macros or cron
expressions instead, as it enforces this idea of rounded schedules.

When using depends_on_past=True it’s important to pay special attention to
start_date as the past dependency is not enforced only on the specific
schedule of the start_date specified for the task. It’ also important to
watch DagRun activity status in time when introducing new
depends_on_past=True, unless you are planning on running a backfill for the
new task(s).

Also important to note is that the tasks start_date, in the context of a
backfill CLI command, get overridden by the backfill’s command start_date.
This allows for a backfill on tasks that havedepends_on_past=True to
actually start, if it wasn’t the case, the backfill just wouldn’t start.

On Sun, Jun 12, 2016 at 3:17 PM, harish singh <harish.singh22@gmail.com>
wrote:

> These are the default args to my DAG.
> I am trying to run a standard hourly job (basically, at the end of
> this hour, process last hours data)
> I noticed that my pipeline is 1 hour late.
>
> For some reason, I am messing up with my start_date I guess.
> What is the best practice for setting up start_date?
>
>
> scheduling_start_date = (datetime.utcnow()).replace(minute=0,
> second=0, microsecond=0) +
> datetime.timedelta(minutes=15)default_schedule_interval =
> datetime.timedelta(minutes=60)default_args = {
>
>     'owner': 'airflow',
>     'depends_on_past': False,
>     'start_date': scheduling_start_date,
>     'email': ['airflow@airflow.com'],
>     'email_on_failure': False,
>     'email_on_retry': False,
>     'retries': 2,
>     'retry_delay': default_retries_delay,    'schedule_interval'=
> default_schedule_interval
>
>     # 'queue': 'bash_queue',
>     # 'pool': 'backfill',
>     # 'priority_weight': 10,
>     # 'end_date': datetime(2016, 1, 1),
> }
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message