airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Meickle <jmeic...@quantopian.com.INVALID>
Subject Re: [DISCUSS] AIRFLOW-4192 - remove duplicate/obsolete/derived task context variables
Date Tue, 09 Apr 2019 14:57:43 GMT
I agree with Ash here. The naming of "execution_date" is incredibly
confusing to people who are new to Airflow, who think it has something to
do with... execution.

However, I think that there's still room for improvement with
"period_start" and "period_end". Think about manually triggered tasks -
they'd have a "period_start", but no useful "period_end" until the next
trigger occurs. And mid-day ETL tasks that are dated to "today" still have
to reference "period_end" to get the correct date, even though the DAGrun
won't be over yet!

If we're going to go through and rename "execution_date", I'd rather see a
wider-ranging rework that understands a mapping from a date to a window
(like "every day, process the data from one week ago", or "every Saturday,
process today's data"). Maybe something closer to how Beam does it
(but retaining simplicity for the 99% of cases that are just "run daily,
for the past day").
https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions

I could see something where there are two default DAG parameters are:
schedule="@daily"
window=-1 (or some schedule_delta object that provides a windowing impl, or
None if there is no window)

In this world, all DAGRuns would have a "schedule_date" (the date this
would have normally started at), and an "execution_date" (when it was
actually executed). If a given DAGrun has a window (one-off DAGs may or may
not!), there would be variables for "window_start" (the start of the
window) and "window_end" (the end of the window; this would default to the
same as schedule_date, and to the next DAGrun's window_start).

Disconnecting schedule date, execution date, and window/period would also
open a pathway for more advanced users to implement irregular schedules and
windows, without having to rely on hacks. For example: a DAG that runs at
"the end of the week" to produce a end-of-week would normally run on
Friday, but on Friday holidays, instead complete a day early (and scan one
fewer day in its window).

I know there's some historical baggage here, but I think the above is much
more natural to new users than what we're offering today. And I think from
a current user perspective, there would be breaking variable name changes,
but no logical differences for the majority of DAGs.

On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <ash@apache.org> wrote:

> To (slightly) hijack this thread:
>
> On the subject of execuction_date: as I'm sure we're all aware the concept
> of execution_date is confusing to new-commers to Airflow (there are many
> questions about "why hasn't my DAG run yet"? "Why is my dag a day behind?"
> etc.) and although we mention this in the docs it's a confusing concept.
>
> What to people think about adding two new parameters: `period_start` and
> `period_end` and making these the preferred terms in place of
> execution_date and next_execution_date?
>
> This hopefully avoids any ambitious terms like "execution" or "run" which
> is understandably easy to conflate with the time the task is being run
> (i.e. `now()`)
>
> If people think this naming is better and less confusing I would suggest
> we update all the docs and examples to use these terms (but still mention
> the old names somewhere, probably in the macros docs)
>
> Thoughts?
>
> -ash
>
>
> > On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wiedmer@gmail.com>
> wrote:
> >
> > Hi Bas,
> >
> > 1) I am aware of a few places where those parameters are used in
> production
> > in a few hundred jobs. I highly recommend we don't deprecate them unless
> we
> > do it in a major version.
> >
> > 2) As James mentioned, inlets and outlets are a lineage annotation
> feature
> > which is still under development. Let's leave them in, but we can guard
> > them behind a feature flag if you prefer.
> >
> > 3) the yesterday*/tomorrow* params are convenience ones if you use a
> daily
> > ETL. I agree with you that they are simple to compute, but not everyone
> > using Apache Airflow is amazing with Python. Some users are only trying
> to
> > get a query scheduled and rely on a couple of niceties like these to get
> by.
> >
> > 4) latest_date, end_date (I feel like there used to be start_date, but
> > maybe it got lost) were a blend of things which were used by a backfill
> > framework used internally at Airbnb. Latest date was used if you needed
> to
> > join to a dimension for which you only wanted the latest version of the
> > attributes in you backfill. end_date was used for time ranges where
> several
> > days were processed together in a range to save on compute. I don't see
> an
> > issue with removing them.
> >
> > Best regards,
> > Arthur
> >
> >
> >
> > On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
> basharenslak@godatadriven.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> Following Tao Feng’s question to discuss this PR<
> >> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
> >> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss
> here
> >> if you agree/disagree/would change.
> >>
> >> -----------
> >>
> >> The summary of the PR:
> >>
> >> I was confused by the task context values and suggest to clean up and
> >> clarify these variables. Some are derivations from other variables, some
> >> are undocumented and unused, some are wrong (name doesn’t match the
> value).
> >> Please discuss what you think of the removal of these variables:
> >>
> >>
> >>  *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> >> tomorrow_ds_nodash. IMO the next_* and previous_* variables are useful
> >> since these require complex logic to compute the next execution date,
> >> however would leave computing the yesterday* and tomorrow* variables up
> to
> >> the user since they are simple one-liners and don't relate to the DAG
> >> interval.
> >>  *   Removed tables. This is a field in params, and is thus also
> >> accessible by the user ({{ params.tables }}). Also, it was undocumented.
> >>  *   Removed latest_date. It's the same as ds and was also undocumented.
> >>  *   Removed inlets and outlets. Also undocumented, and have the
> >> inlets/outlets ever worked/ever been used by anybody?
> >>  *   Removed end_date and END_DATE. Both have the same value, so it
> >> doesn't make sense to have both variables. Also, the value is ds which
> >> contains the start date of the interval, so the naming didn't make
> sense to
> >> me. However, if anybody argues in favour of adding "start_date" and
> >> "end_date" to provide the start and end datetime of task instance
> >> intervals, I'd be happy to add them.
> >>
> >> Cheers,
> >> Bas
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message