airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ash Berlin-Taylor <...@apache.org>
Subject Re: [DISCUSS] AIRFLOW-4192 - remove duplicate/obsolete/derived task context variables
Date Tue, 30 Apr 2019 17:23:40 GMT
Did we come to any conclusion on this topic?


> On 12 Apr 2019, at 10:26, Ash Berlin-Taylor <ash@apache.org> wrote:
> 
> Does anyone actually use end_date (of either spelling) given it's value is currently
the same as `ds`: https://github.com/apache/airflow/blob/3020d9733cb189f091489a62f58f0f586dc8d4a9/airflow/models/taskinstance.py#L1183-L1184

> 
> 
>    'END_DATE': ds,
>    'end_date': ds,
> 
> That seems like it is just the wrong value - I could see it being useful as either the
same value as `next_execution_date` or `dag.end_date` but the current value (from a quick
thought) doesn't seem like it is useful?
> 
> Given what the value is currently set to I vote for deprecating/removing these two.
> 
>> On 12 Apr 2019, at 09:39, Bolke de Bruin <bdbruin@gmail.com> wrote:
>> 
>> I don’t think guarding inlets and outlets makes sense, cause if they are not defined
they code won’t be executed anyway. Ie. The logic is already there, no need for a config
flag.
>> 
>> End-date is actually used by some beyond Airbnb as far as I know. What is the reason
for deprecating it?
>> 
>> Execution_date in finance is actually very well known. A transaction has an execution
date it is not the time of the sending of the data. I honestly think that a change here would
make it a thing that techies understand but does not make business sense. What is “today’s
data?” it might not be the data that was sent today.
>> 
>> 
>> 
>> Verstuurd vanaf mijn iPad
>> 
>>> Op 12 apr. 2019 om 08:35 heeft Maxime Beauchemin <maximebeauchemin@gmail.com>
het volgende geschreven:
>>> 
>>> It should be labelled as a "bridge" or "transition" release somehow, and
>>> tell the community to absolutely to go there first and address all
>>> deprecation warnings prior to upgrade to 2.0. I wonder if semver has
>>> something for that.
>>> 
>>> It might be the time to deprecate more things. Spring cleaning!
>>> 
>>> Should we start another thread labeled "[2.0 spring cleaning] what Airflow
>>> features should we deprecate!?"
>>> 
>>> Max
>>> 
>>>> On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <ash@apache.org>
wrote:
>>>> 
>>>> I am happy to do another small(!) 1.10.x release.
>>>> 
>>>> (There was a small bug I introduced where I broke the rendering of doc_md
>>>> on dags)
>>>> 
>>>> On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <fokko@driesprong.frl>
>>>> wrote:
>>>>> I agree with Max here, we should be careful.
>>>>> 
>>>>> Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
>>>>> tomorrow_ds_nodash. I'm not against having better readable shorthands,
>>>>> but
>>>>> more about the fact that having a tomorrow_ds doesn't make sense when
>>>>> we
>>>>> have an hourly (or weekly) job.
>>>>> 
>>>>> I'm against guarding the {in,out}lets. Since the guarding will add
>>>>> extra
>>>>> logic, and this is not worth the added complexity (and another flag)
in
>>>>> my
>>>>> opinion.
>>>>> 
>>>>> - We start by putting deprecation warnings on tables, latest_date,
>>>>> end_date
>>>>> and END_DATE and remove them in Airflow 2.0.
>>>>> 
>>>>> This is only possible if there is another version before 2.0.
>>>>> Currently,
>>>>> we're preparing to move to 2.0
>>>>> 
>>>>> - We add a lineage_enabled config which is false by default and thus
>>>>> inlets/outlets aren’t provided, unless set to true.
>>>>> 
>>>>> I'm against guarding the {in,out}lets. Since the guarding will add
>>>>> extra
>>>>> logic, and this is not worth the added complexity (and another flag)
in
>>>>> my
>>>>> opinion.
>>>>> 
>>>>> - We continue discussion about clarification of execution_date and
>>>>> better
>>>>> named variables such as something_start and something_end
>>>>> 
>>>>> 
>>>> https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E
>>>>> 
>>>>> Cheers, Fokko
>>>>> 
>>>>> Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak <
>>>>> basharenslak@godatadriven.com>:
>>>>> 
>>>>>> Great discussion, let’s stay on track. If I can summarise:
>>>>>> 
>>>>>> 
>>>>>> *   yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
>>>>> tomorrow_ds_nodash
>>>>>>   *   Arthur: some users use these for convenience
>>>>>>   *   Bas/Fokko: these are values that can be easily derived in a
>>>>>> one-liner
>>>>>> 
>>>>>> *   tables
>>>>>>   *   nobody?
>>>>>> 
>>>>>> *   latest_date
>>>>>>   *   Arthur: used by internal backfill framework at Airbnb, but
>>>>> no
>>>>>> issue removing them.
>>>>>> 
>>>>>> *   inlets/outlets
>>>>>>   *   Arthur: still in development, could be guarded behind
>>>>> feature
>>>>>> flag.
>>>>>> 
>>>>>> *   end_date/END_DATE
>>>>>>   *   Arthur: used by internal backfill framework at Airbnb, but
>>>>> no
>>>>>> issue removing them.
>>>>>> 
>>>>>> And to add to the discussion:
>>>>>> 
>>>>>> 
>>>>>> *   execution_date
>>>>>>   *   The meaning of execution_date is ambivalent and super
>>>>> confusing
>>>>>> to everybody because it’s not the date of execution.
>>>>>>   *   Suggestions to add period_start/period_end, or
>>>>>> interval_start/interval_end, or something like that, to provide
>>>>> datetime
>>>>>> variables for start & end datetime of a DAG run.
>>>>>> *   Other:
>>>>>>   *   Removal of variables should be done in major version and
>>>>>> deprecation warnings should be added.
>>>>>> 
>>>>>> So how about the following:
>>>>>> - We start by putting deprecation warnings on tables, latest_date,
>>>>>> end_date and END_DATE and remove them in Airflow 2.0.
>>>>>> - We add a lineage_enabled config which is false by default and thus
>>>>>> inlets/outlets aren’t provided, unless set to true.
>>>>>> - We continue discussion about yesterday* and tomorrow*
>>>>>> - We continue discussion about clarification of execution_date and
>>>>> better
>>>>>> named variables such as something_start and something_end
>>>>>> 
>>>>>> Cheers,
>>>>>> Bas
>>>>>> 
>>>>>> 
>>>>>> On 11 Apr 2019, at 04:52, Maxime Beauchemin
>>>>> <maximebeauchemin@gmail.com
>>>>>> <mailto:maximebeauchemin@gmail.com>> wrote:
>>>>>> 
>>>>>> Making backwards incompatible changes that require altering the
>>>>> thousands
>>>>>> (millions?!) of DAGs in the wild will alienate the community and
>>>>> prevent
>>>>>> many from orchestrating an upgrade. Upgrading hundreds of DAGs and
>>>>> Airflow
>>>>>> atomically would be hard and dangerous.
>>>>>> 
>>>>>> To mitigate this, changes to the DAG / core API should be backward
>>>>>> compatible, at least for a release range, and alert on upcoming
>>>>> deprecation
>>>>>> (if needed)
>>>>>> 
>>>>>> To be clear if `execution_date` is renamed to `foo`, we should have
a
>>>>>> version range where both work just as well, but using
>>>>> `execution_date`
>>>>>> would result is warning messages about its upcoming deprecation,
>>>>> maybe
>>>>>> something like "It appears you're using the task parameter
>>>>> `execution_date`
>>>>>> which will be deprecated in favor or `foo` in upcoming release 2.0.0,
>>>>> find
>>>>>> more info at link.to/foo<http://link.to/foo>"
>>>>>> 
>>>>>> Max
>>>>>> 
>>>>>> On Tue, Apr 9, 2019 at 7:58 AM James Meickle
>>>>>> 
>>>>> <jmeickle@quantopian.com.invalid<mailto:jmeickle@quantopian.com.invalid>>
>>>>>> wrote:
>>>>>> 
>>>>>> I agree with Ash here. The naming of "execution_date" is incredibly
>>>>>> confusing to people who are new to Airflow, who think it has
>>>>> something to
>>>>>> do with... execution.
>>>>>> 
>>>>>> However, I think that there's still room for improvement with
>>>>>> "period_start" and "period_end". Think about manually triggered tasks
>>>>> -
>>>>>> they'd have a "period_start", but no useful "period_end" until the
>>>>> next
>>>>>> trigger occurs. And mid-day ETL tasks that are dated to "today" still
>>>>> have
>>>>>> to reference "period_end" to get the correct date, even though the
>>>>> DAGrun
>>>>>> won't be over yet!
>>>>>> 
>>>>>> If we're going to go through and rename "execution_date", I'd rather
>>>>> see a
>>>>>> wider-ranging rework that understands a mapping from a date to a
>>>>> window
>>>>>> (like "every day, process the data from one week ago", or "every
>>>>> Saturday,
>>>>>> process today's data"). Maybe something closer to how Beam does it
>>>>>> (but retaining simplicity for the 99% of cases that are just "run
>>>>> daily,
>>>>>> for the past day").
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions
>>>>>> 
>>>>>> I could see something where there are two default DAG parameters
are:
>>>>>> schedule="@daily"
>>>>>> window=-1 (or some schedule_delta object that provides a windowing
>>>>> impl, or
>>>>>> None if there is no window)
>>>>>> 
>>>>>> In this world, all DAGRuns would have a "schedule_date" (the date
>>>>> this
>>>>>> would have normally started at), and an "execution_date" (when it
was
>>>>>> actually executed). If a given DAGrun has a window (one-off DAGs
may
>>>>> or may
>>>>>> not!), there would be variables for "window_start" (the start of
the
>>>>>> window) and "window_end" (the end of the window; this would default
>>>>> to the
>>>>>> same as schedule_date, and to the next DAGrun's window_start).
>>>>>> 
>>>>>> Disconnecting schedule date, execution date, and window/period would
>>>>> also
>>>>>> open a pathway for more advanced users to implement irregular
>>>>> schedules and
>>>>>> windows, without having to rely on hacks. For example: a DAG that
>>>>> runs at
>>>>>> "the end of the week" to produce a end-of-week would normally run
on
>>>>>> Friday, but on Friday holidays, instead complete a day early (and
>>>>> scan one
>>>>>> fewer day in its window).
>>>>>> 
>>>>>> I know there's some historical baggage here, but I think the above
is
>>>>> much
>>>>>> more natural to new users than what we're offering today. And I think
>>>>> from
>>>>>> a current user perspective, there would be breaking variable name
>>>>> changes,
>>>>>> but no logical differences for the majority of DAGs.
>>>>>> 
>>>>>> On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <ash@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>> To (slightly) hijack this thread:
>>>>>> 
>>>>>> On the subject of execuction_date: as I'm sure we're all aware the
>>>>>> concept
>>>>>> of execution_date is confusing to new-commers to Airflow (there are
>>>>> many
>>>>>> questions about "why hasn't my DAG run yet"? "Why is my dag a day
>>>>>> behind?"
>>>>>> etc.) and although we mention this in the docs it's a confusing
>>>>> concept.
>>>>>> 
>>>>>> What to people think about adding two new parameters: `period_start`
>>>>> and
>>>>>> `period_end` and making these the preferred terms in place of
>>>>>> execution_date and next_execution_date?
>>>>>> 
>>>>>> This hopefully avoids any ambitious terms like "execution" or "run"
>>>>> which
>>>>>> is understandably easy to conflate with the time the task is being
>>>>> run
>>>>>> (i.e. `now()`)
>>>>>> 
>>>>>> If people think this naming is better and less confusing I would
>>>>> suggest
>>>>>> we update all the docs and examples to use these terms (but still
>>>>> mention
>>>>>> the old names somewhere, probably in the macros docs)
>>>>>> 
>>>>>> Thoughts?
>>>>>> 
>>>>>> -ash
>>>>>> 
>>>>>> 
>>>>>> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wiedmer@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hi Bas,
>>>>>> 
>>>>>> 1) I am aware of a few places where those parameters are used in
>>>>>> production
>>>>>> in a few hundred jobs. I highly recommend we don't deprecate them
>>>>>> unless
>>>>>> we
>>>>>> do it in a major version.
>>>>>> 
>>>>>> 2) As James mentioned, inlets and outlets are a lineage annotation
>>>>>> feature
>>>>>> which is still under development. Let's leave them in, but we can
>>>>> guard
>>>>>> them behind a feature flag if you prefer.
>>>>>> 
>>>>>> 3) the yesterday*/tomorrow* params are convenience ones if you use
a
>>>>>> daily
>>>>>> ETL. I agree with you that they are simple to compute, but not
>>>>> everyone
>>>>>> using Apache Airflow is amazing with Python. Some users are only
>>>>> trying
>>>>>> to
>>>>>> get a query scheduled and rely on a couple of niceties like these
to
>>>>>> get
>>>>>> by.
>>>>>> 
>>>>>> 4) latest_date, end_date (I feel like there used to be start_date,
>>>>> but
>>>>>> maybe it got lost) were a blend of things which were used by a
>>>>> backfill
>>>>>> framework used internally at Airbnb. Latest date was used if you
>>>>> needed
>>>>>> to
>>>>>> join to a dimension for which you only wanted the latest version
of
>>>>> the
>>>>>> attributes in you backfill. end_date was used for time ranges where
>>>>>> several
>>>>>> days were processed together in a range to save on compute. I don't
>>>>> see
>>>>>> an
>>>>>> issue with removing them.
>>>>>> 
>>>>>> Best regards,
>>>>>> Arthur
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
>>>>>> basharenslak@godatadriven.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> Following Tao Feng’s question to discuss this PR<
>>>>>> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
>>>>>> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss
>>>>>> here
>>>>>> if you agree/disagree/would change.
>>>>>> 
>>>>>> -----------
>>>>>> 
>>>>>> The summary of the PR:
>>>>>> 
>>>>>> I was confused by the task context values and suggest to clean up
and
>>>>>> clarify these variables. Some are derivations from other variables,
>>>>>> some
>>>>>> are undocumented and unused, some are wrong (name doesn’t match
the
>>>>>> value).
>>>>>> Please discuss what you think of the removal of these variables:
>>>>>> 
>>>>>> 
>>>>>> *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
>>>>>> tomorrow_ds_nodash. IMO the next_* and previous_* variables are
>>>>> useful
>>>>>> since these require complex logic to compute the next execution date,
>>>>>> however would leave computing the yesterday* and tomorrow* variables
>>>>>> up
>>>>>> to
>>>>>> the user since they are simple one-liners and don't relate to the
DAG
>>>>>> interval.
>>>>>> *   Removed tables. This is a field in params, and is thus also
>>>>>> accessible by the user ({{ params.tables }}). Also, it was
>>>>>> undocumented.
>>>>>> *   Removed latest_date. It's the same as ds and was also
>>>>>> undocumented.
>>>>>> *   Removed inlets and outlets. Also undocumented, and have the
>>>>>> inlets/outlets ever worked/ever been used by anybody?
>>>>>> *   Removed end_date and END_DATE. Both have the same value, so it
>>>>>> doesn't make sense to have both variables. Also, the value is ds
>>>>> which
>>>>>> contains the start date of the interval, so the naming didn't make
>>>>>> sense to
>>>>>> me. However, if anybody argues in favour of adding "start_date" and
>>>>>> "end_date" to provide the start and end datetime of task instance
>>>>>> intervals, I'd be happy to add them.
>>>>>> 
>>>>>> Cheers,
>>>>>> Bas
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
> 


Mime
View raw message