airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shaw, Damian P. " <>
Subject RE: [DISCUSS] period_start/period_end instead of execution_date/next_execution_date
Date Tue, 09 Apr 2019 14:31:04 GMT
Hi all,

I'm new to this Airflow Dev mailing list so I wasn't expecting to reply to anything but I
feel I am the target audience for this question. I am quite new to airflow and have been setting
up an airflow environment for my business this last month.

I find the current "execution_date" a small technical burden and a large cognitive burden.
Our workflow is based on DAGs running at a specified time in a specified timezone using the
same date as the current calendar date.

I have worked around this by creating my own macro and context variables, with the logic looking
like this:
        airflow_execution_date = context['execution_date']
        dag_timezone = context['dag'].timezone
        local_execution_date = dag_timezone.convert(airflow_execution_date)
        local_cal_date = local_execution_date + datetime.timedelta(days=1)

As you can see this isn't a lot of technical effort, but having a date that 1) is in the timezone
the business users are working in, and 2) Is the same calendar date the business users are
working in it significantly reduces the cognitive effort required to set-up tasks. Of course
this doesn't help with cron format scheduling which I just let the business give me the requirements
for and I set it up myself as the date logic there is still confusing as it doesn't work like
real cron scheduling which everyone is familiar with.

Maybe "period_start" and "period_end" might help people on Day 0 of understanding Airflow
get that the dates you are dealing with are not what you expect, but Day 1+ there's still
a lot of cognitive overhead if you don't have the exact same model as AirBnb for running DAGs
and tasks.

My 2 cents anyway,
Damian Shaw

-----Original Message-----
From: Ash Berlin-Taylor [] 
Sent: Tuesday, April 09, 2019 10:08 AM
Subject: [DISCUSS] period_start/period_end instead of execution_date/next_execution_date 

(trying to break this out in to another thread)

The ML doesn't allow  images, but I can guess that it is the deps section of a task instance
details screen?

I'm not saying it's not clear once you know to look there, but I'm trying remove/reduce the
confusion in the first place. And I think we as committers aren't best placed to know what
makes sense as we have internalised how Airflow works :)

So I guess this is a question to the newest people on the list: Would `period_start` and `period_end`
be more or less confusing for you when you were first getting started with Airflow?


> On 9 Apr 2019, at 14:47, Driesprong, Fokko <> wrote:
> Ash,
> Personally, I think this is quite clear, there is a list of reasons why the job isn't
being scheduled:
> Coming back to the question of Bas, I believe that yesterday_ds does not make sense since
we cannot assume that the schedule is daily. I don't see any usage of this variable. Personally,
I do use next_execution_date quite extensively. When you have a job that runs daily, but you
want to change this to an hourly job. In such a case you don't want to change {{ (execution_date
+ macros.timedelta(days=1)) }} to {{ (execution_date + macros.timedelta(hours=1)) }} everywhere.
> I'm just not sure if the aggressive deprecation of is really worth it. I don't see too
much harm in letting them stay.
> Cheers, Fokko 
> Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor < <>>:
> To (slightly) hijack this thread:
> On the subject of execuction_date: as I'm sure we're all aware the concept of execution_date
is confusing to new-commers to Airflow (there are many questions about "why hasn't my DAG
run yet"? "Why is my dag a day behind?" etc.) and although we mention this in the docs it's
a confusing concept.
> What to people think about adding two new parameters: `period_start` and `period_end`
and making these the preferred terms in place of execution_date and next_execution_date?
> This hopefully avoids any ambitious terms like "execution" or "run" which is understandably
easy to conflate with the time the task is being run (i.e. `now()`) 
> If people think this naming is better and less confusing I would suggest we update all
the docs and examples to use these terms (but still mention the old names somewhere, probably
in the macros docs)
> Thoughts?
> -ash
> > On 8 Apr 2019, at 16:20, Arthur Wiedmer < <>>
> > 
> > Hi Bas,
> > 
> > 1) I am aware of a few places where those parameters are used in production
> > in a few hundred jobs. I highly recommend we don't deprecate them unless we
> > do it in a major version.
> > 
> > 2) As James mentioned, inlets and outlets are a lineage annotation feature
> > which is still under development. Let's leave them in, but we can guard
> > them behind a feature flag if you prefer.
> > 
> > 3) the yesterday*/tomorrow* params are convenience ones if you use a daily
> > ETL. I agree with you that they are simple to compute, but not everyone
> > using Apache Airflow is amazing with Python. Some users are only trying to
> > get a query scheduled and rely on a couple of niceties like these to get by.
> > 
> > 4) latest_date, end_date (I feel like there used to be start_date, but
> > maybe it got lost) were a blend of things which were used by a backfill
> > framework used internally at Airbnb. Latest date was used if you needed to
> > join to a dimension for which you only wanted the latest version of the
> > attributes in you backfill. end_date was used for time ranges where several
> > days were processed together in a range to save on compute. I don't see an
> > issue with removing them.
> > 
> > Best regards,
> > Arthur
> > 
> > 
> > 
> > On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < <>>
> > wrote:
> > 
> >> Hi all,
> >> 
> >> Following Tao Feng’s question to discuss this PR<
> >> <>>
> >> <>>),
please discuss here
> >> if you agree/disagree/would change.
> >> 
> >> -----------
> >> 
> >> The summary of the PR:
> >> 
> >> I was confused by the task context values and suggest to clean up and
> >> clarify these variables. Some are derivations from other variables, some
> >> are undocumented and unused, some are wrong (name doesn’t match the value).
> >> Please discuss what you think of the removal of these variables:
> >> 
> >> 
> >>  *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> >> tomorrow_ds_nodash. IMO the next_* and previous_* variables are useful
> >> since these require complex logic to compute the next execution date,
> >> however would leave computing the yesterday* and tomorrow* variables up to
> >> the user since they are simple one-liners and don't relate to the DAG
> >> interval.
> >>  *   Removed tables. This is a field in params, and is thus also
> >> accessible by the user ({{ params.tables }}). Also, it was undocumented.
> >>  *   Removed latest_date. It's the same as ds and was also undocumented.
> >>  *   Removed inlets and outlets. Also undocumented, and have the
> >> inlets/outlets ever worked/ever been used by anybody?
> >>  *   Removed end_date and END_DATE. Both have the same value, so it
> >> doesn't make sense to have both variables. Also, the value is ds which
> >> contains the start date of the interval, so the naming didn't make sense to
> >> me. However, if anybody argues in favour of adding "start_date" and
> >> "end_date" to provide the start and end datetime of task instance
> >> intervals, I'd be happy to add them.
> >> 
> >> Cheers,
> >> Bas
> >> 

Please access the attached hyperlink for an important electronic communications disclaimer: 

View raw message