airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: [DISCUSS] AIRFLOW-4192 - remove duplicate/obsolete/derived task context variables
Date Fri, 12 Apr 2019 08:39:52 GMT
I don’t think guarding inlets and outlets makes sense, cause if they are not defined they
code won’t be executed anyway. Ie. The logic is already there, no need for a config flag.

End-date is actually used by some beyond Airbnb as far as I know. What is the reason for deprecating
it?

Execution_date in finance is actually very well known. A transaction has an execution date
it is not the time of the sending of the data. I honestly think that a change here would make
it a thing that techies understand but does not make business sense. What is “today’s
data?” it might not be the data that was sent today.



Verstuurd vanaf mijn iPad

> Op 12 apr. 2019 om 08:35 heeft Maxime Beauchemin <maximebeauchemin@gmail.com> het
volgende geschreven:
> 
> It should be labelled as a "bridge" or "transition" release somehow, and
> tell the community to absolutely to go there first and address all
> deprecation warnings prior to upgrade to 2.0. I wonder if semver has
> something for that.
> 
> It might be the time to deprecate more things. Spring cleaning!
> 
> Should we start another thread labeled "[2.0 spring cleaning] what Airflow
> features should we deprecate!?"
> 
> Max
> 
>> On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <ash@apache.org> wrote:
>> 
>> I am happy to do another small(!) 1.10.x release.
>> 
>> (There was a small bug I introduced where I broke the rendering of doc_md
>> on dags)
>> 
>> On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <fokko@driesprong.frl>
>> wrote:
>>> I agree with Max here, we should be careful.
>>> 
>>> Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
>>> tomorrow_ds_nodash. I'm not against having better readable shorthands,
>>> but
>>> more about the fact that having a tomorrow_ds doesn't make sense when
>>> we
>>> have an hourly (or weekly) job.
>>> 
>>> I'm against guarding the {in,out}lets. Since the guarding will add
>>> extra
>>> logic, and this is not worth the added complexity (and another flag) in
>>> my
>>> opinion.
>>> 
>>> - We start by putting deprecation warnings on tables, latest_date,
>>> end_date
>>> and END_DATE and remove them in Airflow 2.0.
>>> 
>>> This is only possible if there is another version before 2.0.
>>> Currently,
>>> we're preparing to move to 2.0
>>> 
>>> - We add a lineage_enabled config which is false by default and thus
>>> inlets/outlets aren’t provided, unless set to true.
>>> 
>>> I'm against guarding the {in,out}lets. Since the guarding will add
>>> extra
>>> logic, and this is not worth the added complexity (and another flag) in
>>> my
>>> opinion.
>>> 
>>> - We continue discussion about clarification of execution_date and
>>> better
>>> named variables such as something_start and something_end
>>> 
>>> 
>> https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E
>>> 
>>> Cheers, Fokko
>>> 
>>> Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak <
>>> basharenslak@godatadriven.com>:
>>> 
>>>> Great discussion, let’s stay on track. If I can summarise:
>>>> 
>>>> 
>>>>  *   yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
>>> tomorrow_ds_nodash
>>>>     *   Arthur: some users use these for convenience
>>>>     *   Bas/Fokko: these are values that can be easily derived in a
>>>> one-liner
>>>> 
>>>>  *   tables
>>>>     *   nobody?
>>>> 
>>>>  *   latest_date
>>>>     *   Arthur: used by internal backfill framework at Airbnb, but
>>> no
>>>> issue removing them.
>>>> 
>>>>  *   inlets/outlets
>>>>     *   Arthur: still in development, could be guarded behind
>>> feature
>>>> flag.
>>>> 
>>>>  *   end_date/END_DATE
>>>>     *   Arthur: used by internal backfill framework at Airbnb, but
>>> no
>>>> issue removing them.
>>>> 
>>>> And to add to the discussion:
>>>> 
>>>> 
>>>>  *   execution_date
>>>>     *   The meaning of execution_date is ambivalent and super
>>> confusing
>>>> to everybody because it’s not the date of execution.
>>>>     *   Suggestions to add period_start/period_end, or
>>>> interval_start/interval_end, or something like that, to provide
>>> datetime
>>>> variables for start & end datetime of a DAG run.
>>>>  *   Other:
>>>>     *   Removal of variables should be done in major version and
>>>> deprecation warnings should be added.
>>>> 
>>>> So how about the following:
>>>> - We start by putting deprecation warnings on tables, latest_date,
>>>> end_date and END_DATE and remove them in Airflow 2.0.
>>>> - We add a lineage_enabled config which is false by default and thus
>>>> inlets/outlets aren’t provided, unless set to true.
>>>> - We continue discussion about yesterday* and tomorrow*
>>>> - We continue discussion about clarification of execution_date and
>>> better
>>>> named variables such as something_start and something_end
>>>> 
>>>> Cheers,
>>>> Bas
>>>> 
>>>> 
>>>> On 11 Apr 2019, at 04:52, Maxime Beauchemin
>>> <maximebeauchemin@gmail.com
>>>> <mailto:maximebeauchemin@gmail.com>> wrote:
>>>> 
>>>> Making backwards incompatible changes that require altering the
>>> thousands
>>>> (millions?!) of DAGs in the wild will alienate the community and
>>> prevent
>>>> many from orchestrating an upgrade. Upgrading hundreds of DAGs and
>>> Airflow
>>>> atomically would be hard and dangerous.
>>>> 
>>>> To mitigate this, changes to the DAG / core API should be backward
>>>> compatible, at least for a release range, and alert on upcoming
>>> deprecation
>>>> (if needed)
>>>> 
>>>> To be clear if `execution_date` is renamed to `foo`, we should have a
>>>> version range where both work just as well, but using
>>> `execution_date`
>>>> would result is warning messages about its upcoming deprecation,
>>> maybe
>>>> something like "It appears you're using the task parameter
>>> `execution_date`
>>>> which will be deprecated in favor or `foo` in upcoming release 2.0.0,
>>> find
>>>> more info at link.to/foo<http://link.to/foo>"
>>>> 
>>>> Max
>>>> 
>>>> On Tue, Apr 9, 2019 at 7:58 AM James Meickle
>>>> 
>>> <jmeickle@quantopian.com.invalid<mailto:jmeickle@quantopian.com.invalid>>
>>>> wrote:
>>>> 
>>>> I agree with Ash here. The naming of "execution_date" is incredibly
>>>> confusing to people who are new to Airflow, who think it has
>>> something to
>>>> do with... execution.
>>>> 
>>>> However, I think that there's still room for improvement with
>>>> "period_start" and "period_end". Think about manually triggered tasks
>>> -
>>>> they'd have a "period_start", but no useful "period_end" until the
>>> next
>>>> trigger occurs. And mid-day ETL tasks that are dated to "today" still
>>> have
>>>> to reference "period_end" to get the correct date, even though the
>>> DAGrun
>>>> won't be over yet!
>>>> 
>>>> If we're going to go through and rename "execution_date", I'd rather
>>> see a
>>>> wider-ranging rework that understands a mapping from a date to a
>>> window
>>>> (like "every day, process the data from one week ago", or "every
>>> Saturday,
>>>> process today's data"). Maybe something closer to how Beam does it
>>>> (but retaining simplicity for the 99% of cases that are just "run
>>> daily,
>>>> for the past day").
>>>> 
>>>> 
>>>> 
>>> 
>> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions
>>>> 
>>>> I could see something where there are two default DAG parameters are:
>>>> schedule="@daily"
>>>> window=-1 (or some schedule_delta object that provides a windowing
>>> impl, or
>>>> None if there is no window)
>>>> 
>>>> In this world, all DAGRuns would have a "schedule_date" (the date
>>> this
>>>> would have normally started at), and an "execution_date" (when it was
>>>> actually executed). If a given DAGrun has a window (one-off DAGs may
>>> or may
>>>> not!), there would be variables for "window_start" (the start of the
>>>> window) and "window_end" (the end of the window; this would default
>>> to the
>>>> same as schedule_date, and to the next DAGrun's window_start).
>>>> 
>>>> Disconnecting schedule date, execution date, and window/period would
>>> also
>>>> open a pathway for more advanced users to implement irregular
>>> schedules and
>>>> windows, without having to rely on hacks. For example: a DAG that
>>> runs at
>>>> "the end of the week" to produce a end-of-week would normally run on
>>>> Friday, but on Friday holidays, instead complete a day early (and
>>> scan one
>>>> fewer day in its window).
>>>> 
>>>> I know there's some historical baggage here, but I think the above is
>>> much
>>>> more natural to new users than what we're offering today. And I think
>>> from
>>>> a current user perspective, there would be breaking variable name
>>> changes,
>>>> but no logical differences for the majority of DAGs.
>>>> 
>>>> On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <ash@apache.org>
>>> wrote:
>>>> 
>>>> To (slightly) hijack this thread:
>>>> 
>>>> On the subject of execuction_date: as I'm sure we're all aware the
>>>> concept
>>>> of execution_date is confusing to new-commers to Airflow (there are
>>> many
>>>> questions about "why hasn't my DAG run yet"? "Why is my dag a day
>>>> behind?"
>>>> etc.) and although we mention this in the docs it's a confusing
>>> concept.
>>>> 
>>>> What to people think about adding two new parameters: `period_start`
>>> and
>>>> `period_end` and making these the preferred terms in place of
>>>> execution_date and next_execution_date?
>>>> 
>>>> This hopefully avoids any ambitious terms like "execution" or "run"
>>> which
>>>> is understandably easy to conflate with the time the task is being
>>> run
>>>> (i.e. `now()`)
>>>> 
>>>> If people think this naming is better and less confusing I would
>>> suggest
>>>> we update all the docs and examples to use these terms (but still
>>> mention
>>>> the old names somewhere, probably in the macros docs)
>>>> 
>>>> Thoughts?
>>>> 
>>>> -ash
>>>> 
>>>> 
>>>> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wiedmer@gmail.com>
>>>> wrote:
>>>> 
>>>> Hi Bas,
>>>> 
>>>> 1) I am aware of a few places where those parameters are used in
>>>> production
>>>> in a few hundred jobs. I highly recommend we don't deprecate them
>>>> unless
>>>> we
>>>> do it in a major version.
>>>> 
>>>> 2) As James mentioned, inlets and outlets are a lineage annotation
>>>> feature
>>>> which is still under development. Let's leave them in, but we can
>>> guard
>>>> them behind a feature flag if you prefer.
>>>> 
>>>> 3) the yesterday*/tomorrow* params are convenience ones if you use a
>>>> daily
>>>> ETL. I agree with you that they are simple to compute, but not
>>> everyone
>>>> using Apache Airflow is amazing with Python. Some users are only
>>> trying
>>>> to
>>>> get a query scheduled and rely on a couple of niceties like these to
>>>> get
>>>> by.
>>>> 
>>>> 4) latest_date, end_date (I feel like there used to be start_date,
>>> but
>>>> maybe it got lost) were a blend of things which were used by a
>>> backfill
>>>> framework used internally at Airbnb. Latest date was used if you
>>> needed
>>>> to
>>>> join to a dimension for which you only wanted the latest version of
>>> the
>>>> attributes in you backfill. end_date was used for time ranges where
>>>> several
>>>> days were processed together in a range to save on compute. I don't
>>> see
>>>> an
>>>> issue with removing them.
>>>> 
>>>> Best regards,
>>>> Arthur
>>>> 
>>>> 
>>>> 
>>>> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
>>>> basharenslak@godatadriven.com>
>>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> Following Tao Feng’s question to discuss this PR<
>>>> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
>>>> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss
>>>> here
>>>> if you agree/disagree/would change.
>>>> 
>>>> -----------
>>>> 
>>>> The summary of the PR:
>>>> 
>>>> I was confused by the task context values and suggest to clean up and
>>>> clarify these variables. Some are derivations from other variables,
>>>> some
>>>> are undocumented and unused, some are wrong (name doesn’t match the
>>>> value).
>>>> Please discuss what you think of the removal of these variables:
>>>> 
>>>> 
>>>> *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
>>>> tomorrow_ds_nodash. IMO the next_* and previous_* variables are
>>> useful
>>>> since these require complex logic to compute the next execution date,
>>>> however would leave computing the yesterday* and tomorrow* variables
>>>> up
>>>> to
>>>> the user since they are simple one-liners and don't relate to the DAG
>>>> interval.
>>>> *   Removed tables. This is a field in params, and is thus also
>>>> accessible by the user ({{ params.tables }}). Also, it was
>>>> undocumented.
>>>> *   Removed latest_date. It's the same as ds and was also
>>>> undocumented.
>>>> *   Removed inlets and outlets. Also undocumented, and have the
>>>> inlets/outlets ever worked/ever been used by anybody?
>>>> *   Removed end_date and END_DATE. Both have the same value, so it
>>>> doesn't make sense to have both variables. Also, the value is ds
>>> which
>>>> contains the start date of the interval, so the naming didn't make
>>>> sense to
>>>> me. However, if anybody argues in favour of adding "start_date" and
>>>> "end_date" to provide the start and end datetime of task instance
>>>> intervals, I'd be happy to add them.
>>>> 
>>>> Cheers,
>>>> Bas
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 

Mime
View raw message