airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: [DISCUSS] AIRFLOW-4192 - remove duplicate/obsolete/derived task context variables
Date Fri, 12 Apr 2019 06:35:57 GMT
It should be labelled as a "bridge" or "transition" release somehow, and
tell the community to absolutely to go there first and address all
deprecation warnings prior to upgrade to 2.0. I wonder if semver has
something for that.

It might be the time to deprecate more things. Spring cleaning!

Should we start another thread labeled "[2.0 spring cleaning] what Airflow
features should we deprecate!?"

Max

On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <ash@apache.org> wrote:

> I am happy to do another small(!) 1.10.x release.
>
> (There was a small bug I introduced where I broke the rendering of doc_md
> on dags)
>
> On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <fokko@driesprong.frl>
> wrote:
> >I agree with Max here, we should be careful.
> >
> >Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> >tomorrow_ds_nodash. I'm not against having better readable shorthands,
> >but
> >more about the fact that having a tomorrow_ds doesn't make sense when
> >we
> >have an hourly (or weekly) job.
> >
> >I'm against guarding the {in,out}lets. Since the guarding will add
> >extra
> >logic, and this is not worth the added complexity (and another flag) in
> >my
> >opinion.
> >
> >- We start by putting deprecation warnings on tables, latest_date,
> >end_date
> >and END_DATE and remove them in Airflow 2.0.
> >
> >This is only possible if there is another version before 2.0.
> >Currently,
> >we're preparing to move to 2.0
> >
> >- We add a lineage_enabled config which is false by default and thus
> >inlets/outlets aren’t provided, unless set to true.
> >
> >I'm against guarding the {in,out}lets. Since the guarding will add
> >extra
> >logic, and this is not worth the added complexity (and another flag) in
> >my
> >opinion.
> >
> >- We continue discussion about clarification of execution_date and
> >better
> >named variables such as something_start and something_end
> >
> >
> https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E
> >
> >Cheers, Fokko
> >
> >Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak <
> >basharenslak@godatadriven.com>:
> >
> >> Great discussion, let’s stay on track. If I can summarise:
> >>
> >>
> >>   *   yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> >tomorrow_ds_nodash
> >>      *   Arthur: some users use these for convenience
> >>      *   Bas/Fokko: these are values that can be easily derived in a
> >> one-liner
> >>
> >>   *   tables
> >>      *   nobody?
> >>
> >>   *   latest_date
> >>      *   Arthur: used by internal backfill framework at Airbnb, but
> >no
> >> issue removing them.
> >>
> >>   *   inlets/outlets
> >>      *   Arthur: still in development, could be guarded behind
> >feature
> >> flag.
> >>
> >>   *   end_date/END_DATE
> >>      *   Arthur: used by internal backfill framework at Airbnb, but
> >no
> >> issue removing them.
> >>
> >> And to add to the discussion:
> >>
> >>
> >>   *   execution_date
> >>      *   The meaning of execution_date is ambivalent and super
> >confusing
> >> to everybody because it’s not the date of execution.
> >>      *   Suggestions to add period_start/period_end, or
> >> interval_start/interval_end, or something like that, to provide
> >datetime
> >> variables for start & end datetime of a DAG run.
> >>   *   Other:
> >>      *   Removal of variables should be done in major version and
> >> deprecation warnings should be added.
> >>
> >> So how about the following:
> >> - We start by putting deprecation warnings on tables, latest_date,
> >> end_date and END_DATE and remove them in Airflow 2.0.
> >> - We add a lineage_enabled config which is false by default and thus
> >> inlets/outlets aren’t provided, unless set to true.
> >> - We continue discussion about yesterday* and tomorrow*
> >> - We continue discussion about clarification of execution_date and
> >better
> >> named variables such as something_start and something_end
> >>
> >> Cheers,
> >> Bas
> >>
> >>
> >> On 11 Apr 2019, at 04:52, Maxime Beauchemin
> ><maximebeauchemin@gmail.com
> >> <mailto:maximebeauchemin@gmail.com>> wrote:
> >>
> >> Making backwards incompatible changes that require altering the
> >thousands
> >> (millions?!) of DAGs in the wild will alienate the community and
> >prevent
> >> many from orchestrating an upgrade. Upgrading hundreds of DAGs and
> >Airflow
> >> atomically would be hard and dangerous.
> >>
> >> To mitigate this, changes to the DAG / core API should be backward
> >> compatible, at least for a release range, and alert on upcoming
> >deprecation
> >> (if needed)
> >>
> >> To be clear if `execution_date` is renamed to `foo`, we should have a
> >> version range where both work just as well, but using
> >`execution_date`
> >> would result is warning messages about its upcoming deprecation,
> >maybe
> >> something like "It appears you're using the task parameter
> >`execution_date`
> >> which will be deprecated in favor or `foo` in upcoming release 2.0.0,
> >find
> >> more info at link.to/foo<http://link.to/foo>"
> >>
> >> Max
> >>
> >> On Tue, Apr 9, 2019 at 7:58 AM James Meickle
> >>
> ><jmeickle@quantopian.com.invalid<mailto:jmeickle@quantopian.com.invalid>>
> >> wrote:
> >>
> >> I agree with Ash here. The naming of "execution_date" is incredibly
> >> confusing to people who are new to Airflow, who think it has
> >something to
> >> do with... execution.
> >>
> >> However, I think that there's still room for improvement with
> >> "period_start" and "period_end". Think about manually triggered tasks
> >-
> >> they'd have a "period_start", but no useful "period_end" until the
> >next
> >> trigger occurs. And mid-day ETL tasks that are dated to "today" still
> >have
> >> to reference "period_end" to get the correct date, even though the
> >DAGrun
> >> won't be over yet!
> >>
> >> If we're going to go through and rename "execution_date", I'd rather
> >see a
> >> wider-ranging rework that understands a mapping from a date to a
> >window
> >> (like "every day, process the data from one week ago", or "every
> >Saturday,
> >> process today's data"). Maybe something closer to how Beam does it
> >> (but retaining simplicity for the 99% of cases that are just "run
> >daily,
> >> for the past day").
> >>
> >>
> >>
> >
> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions
> >>
> >> I could see something where there are two default DAG parameters are:
> >> schedule="@daily"
> >> window=-1 (or some schedule_delta object that provides a windowing
> >impl, or
> >> None if there is no window)
> >>
> >> In this world, all DAGRuns would have a "schedule_date" (the date
> >this
> >> would have normally started at), and an "execution_date" (when it was
> >> actually executed). If a given DAGrun has a window (one-off DAGs may
> >or may
> >> not!), there would be variables for "window_start" (the start of the
> >> window) and "window_end" (the end of the window; this would default
> >to the
> >> same as schedule_date, and to the next DAGrun's window_start).
> >>
> >> Disconnecting schedule date, execution date, and window/period would
> >also
> >> open a pathway for more advanced users to implement irregular
> >schedules and
> >> windows, without having to rely on hacks. For example: a DAG that
> >runs at
> >> "the end of the week" to produce a end-of-week would normally run on
> >> Friday, but on Friday holidays, instead complete a day early (and
> >scan one
> >> fewer day in its window).
> >>
> >> I know there's some historical baggage here, but I think the above is
> >much
> >> more natural to new users than what we're offering today. And I think
> >from
> >> a current user perspective, there would be breaking variable name
> >changes,
> >> but no logical differences for the majority of DAGs.
> >>
> >> On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <ash@apache.org>
> >wrote:
> >>
> >> To (slightly) hijack this thread:
> >>
> >> On the subject of execuction_date: as I'm sure we're all aware the
> >> concept
> >> of execution_date is confusing to new-commers to Airflow (there are
> >many
> >> questions about "why hasn't my DAG run yet"? "Why is my dag a day
> >> behind?"
> >> etc.) and although we mention this in the docs it's a confusing
> >concept.
> >>
> >> What to people think about adding two new parameters: `period_start`
> >and
> >> `period_end` and making these the preferred terms in place of
> >> execution_date and next_execution_date?
> >>
> >> This hopefully avoids any ambitious terms like "execution" or "run"
> >which
> >> is understandably easy to conflate with the time the task is being
> >run
> >> (i.e. `now()`)
> >>
> >> If people think this naming is better and less confusing I would
> >suggest
> >> we update all the docs and examples to use these terms (but still
> >mention
> >> the old names somewhere, probably in the macros docs)
> >>
> >> Thoughts?
> >>
> >> -ash
> >>
> >>
> >> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wiedmer@gmail.com>
> >> wrote:
> >>
> >> Hi Bas,
> >>
> >> 1) I am aware of a few places where those parameters are used in
> >> production
> >> in a few hundred jobs. I highly recommend we don't deprecate them
> >> unless
> >> we
> >> do it in a major version.
> >>
> >> 2) As James mentioned, inlets and outlets are a lineage annotation
> >> feature
> >> which is still under development. Let's leave them in, but we can
> >guard
> >> them behind a feature flag if you prefer.
> >>
> >> 3) the yesterday*/tomorrow* params are convenience ones if you use a
> >> daily
> >> ETL. I agree with you that they are simple to compute, but not
> >everyone
> >> using Apache Airflow is amazing with Python. Some users are only
> >trying
> >> to
> >> get a query scheduled and rely on a couple of niceties like these to
> >> get
> >> by.
> >>
> >> 4) latest_date, end_date (I feel like there used to be start_date,
> >but
> >> maybe it got lost) were a blend of things which were used by a
> >backfill
> >> framework used internally at Airbnb. Latest date was used if you
> >needed
> >> to
> >> join to a dimension for which you only wanted the latest version of
> >the
> >> attributes in you backfill. end_date was used for time ranges where
> >> several
> >> days were processed together in a range to save on compute. I don't
> >see
> >> an
> >> issue with removing them.
> >>
> >> Best regards,
> >> Arthur
> >>
> >>
> >>
> >> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
> >> basharenslak@godatadriven.com>
> >> wrote:
> >>
> >> Hi all,
> >>
> >> Following Tao Feng’s question to discuss this PR<
> >> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
> >> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss
> >> here
> >> if you agree/disagree/would change.
> >>
> >> -----------
> >>
> >> The summary of the PR:
> >>
> >> I was confused by the task context values and suggest to clean up and
> >> clarify these variables. Some are derivations from other variables,
> >> some
> >> are undocumented and unused, some are wrong (name doesn’t match the
> >> value).
> >> Please discuss what you think of the removal of these variables:
> >>
> >>
> >> *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> >> tomorrow_ds_nodash. IMO the next_* and previous_* variables are
> >useful
> >> since these require complex logic to compute the next execution date,
> >> however would leave computing the yesterday* and tomorrow* variables
> >> up
> >> to
> >> the user since they are simple one-liners and don't relate to the DAG
> >> interval.
> >> *   Removed tables. This is a field in params, and is thus also
> >> accessible by the user ({{ params.tables }}). Also, it was
> >> undocumented.
> >> *   Removed latest_date. It's the same as ds and was also
> >> undocumented.
> >> *   Removed inlets and outlets. Also undocumented, and have the
> >> inlets/outlets ever worked/ever been used by anybody?
> >> *   Removed end_date and END_DATE. Both have the same value, so it
> >> doesn't make sense to have both variables. Also, the value is ds
> >which
> >> contains the start date of the interval, so the naming didn't make
> >> sense to
> >> me. However, if anybody argues in favour of adding "start_date" and
> >> "end_date" to provide the start and end datetime of task instance
> >> intervals, I'd be happy to add them.
> >>
> >> Cheers,
> >> Bas
> >>
> >>
> >>
> >>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message