airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Wiedmer <arthur.wied...@gmail.com>
Subject Re: [DISCUSS] AIRFLOW-4192 - remove duplicate/obsolete/derived task context variables
Date Mon, 08 Apr 2019 15:20:26 GMT
Hi Bas,

1) I am aware of a few places where those parameters are used in production
in a few hundred jobs. I highly recommend we don't deprecate them unless we
do it in a major version.

2) As James mentioned, inlets and outlets are a lineage annotation feature
which is still under development. Let's leave them in, but we can guard
them behind a feature flag if you prefer.

3) the yesterday*/tomorrow* params are convenience ones if you use a daily
ETL. I agree with you that they are simple to compute, but not everyone
using Apache Airflow is amazing with Python. Some users are only trying to
get a query scheduled and rely on a couple of niceties like these to get by.

4) latest_date, end_date (I feel like there used to be start_date, but
maybe it got lost) were a blend of things which were used by a backfill
framework used internally at Airbnb. Latest date was used if you needed to
join to a dimension for which you only wanted the latest version of the
attributes in you backfill. end_date was used for time ranges where several
days were processed together in a range to save on compute. I don't see an
issue with removing them.

Best regards,
Arthur



On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <basharenslak@godatadriven.com>
wrote:

> Hi all,
>
> Following Tao Feng’s question to discuss this PR<
> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss here
> if you agree/disagree/would change.
>
> -----------
>
> The summary of the PR:
>
> I was confused by the task context values and suggest to clean up and
> clarify these variables. Some are derivations from other variables, some
> are undocumented and unused, some are wrong (name doesn’t match the value).
> Please discuss what you think of the removal of these variables:
>
>
>   *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> tomorrow_ds_nodash. IMO the next_* and previous_* variables are useful
> since these require complex logic to compute the next execution date,
> however would leave computing the yesterday* and tomorrow* variables up to
> the user since they are simple one-liners and don't relate to the DAG
> interval.
>   *   Removed tables. This is a field in params, and is thus also
> accessible by the user ({{ params.tables }}). Also, it was undocumented.
>   *   Removed latest_date. It's the same as ds and was also undocumented.
>   *   Removed inlets and outlets. Also undocumented, and have the
> inlets/outlets ever worked/ever been used by anybody?
>   *   Removed end_date and END_DATE. Both have the same value, so it
> doesn't make sense to have both variables. Also, the value is ds which
> contains the start date of the interval, so the naming didn't make sense to
> me. However, if anybody argues in favour of adding "start_date" and
> "end_date" to provide the start and end datetime of task instance
> intervals, I'd be happy to add them.
>
> Cheers,
> Bas
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message