airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bas Harenslak <basharens...@godatadriven.com>
Subject Re: [DISCUSS] AIRFLOW-4192 - remove duplicate/obsolete/derived task context variables
Date Thu, 11 Apr 2019 06:44:22 GMT
Great discussion, let’s stay on track. If I can summarise:


  *   yesterday_ds, yesterday_ds_nodash, tomorrow_ds, tomorrow_ds_nodash
     *   Arthur: some users use these for convenience
     *   Bas/Fokko: these are values that can be easily derived in a one-liner

  *   tables
     *   nobody?

  *   latest_date
     *   Arthur: used by internal backfill framework at Airbnb, but no issue removing them.

  *   inlets/outlets
     *   Arthur: still in development, could be guarded behind feature flag.

  *   end_date/END_DATE
     *   Arthur: used by internal backfill framework at Airbnb, but no issue removing them.

And to add to the discussion:


  *   execution_date
     *   The meaning of execution_date is ambivalent and super confusing to everybody because
it’s not the date of execution.
     *   Suggestions to add period_start/period_end, or interval_start/interval_end, or something
like that, to provide datetime variables for start & end datetime of a DAG run.
  *   Other:
     *   Removal of variables should be done in major version and deprecation warnings should
be added.

So how about the following:
- We start by putting deprecation warnings on tables, latest_date, end_date and END_DATE and
remove them in Airflow 2.0.
- We add a lineage_enabled config which is false by default and thus inlets/outlets aren’t
provided, unless set to true.
- We continue discussion about yesterday* and tomorrow*
- We continue discussion about clarification of execution_date and better named variables
such as something_start and something_end

Cheers,
Bas


On 11 Apr 2019, at 04:52, Maxime Beauchemin <maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>>
wrote:

Making backwards incompatible changes that require altering the thousands
(millions?!) of DAGs in the wild will alienate the community and prevent
many from orchestrating an upgrade. Upgrading hundreds of DAGs and Airflow
atomically would be hard and dangerous.

To mitigate this, changes to the DAG / core API should be backward
compatible, at least for a release range, and alert on upcoming deprecation
(if needed)

To be clear if `execution_date` is renamed to `foo`, we should have a
version range where both work just as well, but using `execution_date`
would result is warning messages about its upcoming deprecation, maybe
something like "It appears you're using the task parameter `execution_date`
which will be deprecated in favor or `foo` in upcoming release 2.0.0, find
more info at link.to/foo<http://link.to/foo>"

Max

On Tue, Apr 9, 2019 at 7:58 AM James Meickle
<jmeickle@quantopian.com.invalid<mailto:jmeickle@quantopian.com.invalid>> wrote:

I agree with Ash here. The naming of "execution_date" is incredibly
confusing to people who are new to Airflow, who think it has something to
do with... execution.

However, I think that there's still room for improvement with
"period_start" and "period_end". Think about manually triggered tasks -
they'd have a "period_start", but no useful "period_end" until the next
trigger occurs. And mid-day ETL tasks that are dated to "today" still have
to reference "period_end" to get the correct date, even though the DAGrun
won't be over yet!

If we're going to go through and rename "execution_date", I'd rather see a
wider-ranging rework that understands a mapping from a date to a window
(like "every day, process the data from one week ago", or "every Saturday,
process today's data"). Maybe something closer to how Beam does it
(but retaining simplicity for the 99% of cases that are just "run daily,
for the past day").

https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions

I could see something where there are two default DAG parameters are:
schedule="@daily"
window=-1 (or some schedule_delta object that provides a windowing impl, or
None if there is no window)

In this world, all DAGRuns would have a "schedule_date" (the date this
would have normally started at), and an "execution_date" (when it was
actually executed). If a given DAGrun has a window (one-off DAGs may or may
not!), there would be variables for "window_start" (the start of the
window) and "window_end" (the end of the window; this would default to the
same as schedule_date, and to the next DAGrun's window_start).

Disconnecting schedule date, execution date, and window/period would also
open a pathway for more advanced users to implement irregular schedules and
windows, without having to rely on hacks. For example: a DAG that runs at
"the end of the week" to produce a end-of-week would normally run on
Friday, but on Friday holidays, instead complete a day early (and scan one
fewer day in its window).

I know there's some historical baggage here, but I think the above is much
more natural to new users than what we're offering today. And I think from
a current user perspective, there would be breaking variable name changes,
but no logical differences for the majority of DAGs.

On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <ash@apache.org> wrote:

To (slightly) hijack this thread:

On the subject of execuction_date: as I'm sure we're all aware the
concept
of execution_date is confusing to new-commers to Airflow (there are many
questions about "why hasn't my DAG run yet"? "Why is my dag a day
behind?"
etc.) and although we mention this in the docs it's a confusing concept.

What to people think about adding two new parameters: `period_start` and
`period_end` and making these the preferred terms in place of
execution_date and next_execution_date?

This hopefully avoids any ambitious terms like "execution" or "run" which
is understandably easy to conflate with the time the task is being run
(i.e. `now()`)

If people think this naming is better and less confusing I would suggest
we update all the docs and examples to use these terms (but still mention
the old names somewhere, probably in the macros docs)

Thoughts?

-ash


On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wiedmer@gmail.com>
wrote:

Hi Bas,

1) I am aware of a few places where those parameters are used in
production
in a few hundred jobs. I highly recommend we don't deprecate them
unless
we
do it in a major version.

2) As James mentioned, inlets and outlets are a lineage annotation
feature
which is still under development. Let's leave them in, but we can guard
them behind a feature flag if you prefer.

3) the yesterday*/tomorrow* params are convenience ones if you use a
daily
ETL. I agree with you that they are simple to compute, but not everyone
using Apache Airflow is amazing with Python. Some users are only trying
to
get a query scheduled and rely on a couple of niceties like these to
get
by.

4) latest_date, end_date (I feel like there used to be start_date, but
maybe it got lost) were a blend of things which were used by a backfill
framework used internally at Airbnb. Latest date was used if you needed
to
join to a dimension for which you only wanted the latest version of the
attributes in you backfill. end_date was used for time ranges where
several
days were processed together in a range to save on compute. I don't see
an
issue with removing them.

Best regards,
Arthur



On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
basharenslak@godatadriven.com>
wrote:

Hi all,

Following Tao Feng’s question to discuss this PR<
https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss
here
if you agree/disagree/would change.

-----------

The summary of the PR:

I was confused by the task context values and suggest to clean up and
clarify these variables. Some are derivations from other variables,
some
are undocumented and unused, some are wrong (name doesn’t match the
value).
Please discuss what you think of the removal of these variables:


*   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
tomorrow_ds_nodash. IMO the next_* and previous_* variables are useful
since these require complex logic to compute the next execution date,
however would leave computing the yesterday* and tomorrow* variables
up
to
the user since they are simple one-liners and don't relate to the DAG
interval.
*   Removed tables. This is a field in params, and is thus also
accessible by the user ({{ params.tables }}). Also, it was
undocumented.
*   Removed latest_date. It's the same as ds and was also
undocumented.
*   Removed inlets and outlets. Also undocumented, and have the
inlets/outlets ever worked/ever been used by anybody?
*   Removed end_date and END_DATE. Both have the same value, so it
doesn't make sense to have both variables. Also, the value is ds which
contains the start date of the interval, so the naming didn't make
sense to
me. However, if anybody argues in favour of adding "start_date" and
"end_date" to provide the start and end datetime of task instance
intervals, I'd be happy to add them.

Cheers,
Bas





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message