airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: [DISCUSS] period_start/period_end instead of execution_date/next_execution_date
Date Wed, 10 Apr 2019 13:30:33 GMT
I see what you mean. I don't really like the `period_{start,end}` name, but
something such as `interval_{start,end}` might do it for me.

Personally, I think running the job after the interval closes (since then
you have all the data over the interval), makes complete sense for ETL
jobs. I agree it requires some time to get used to. Maybe we're lacking on
documentation here.

Cheers, Fokko

Op wo 10 apr. 2019 om 10:08 schreef Flo Rance <trourance@gmail.com>:

> I didn't expect to participate at any debate on that software, as I'm a
> complete newcomer. But I'm almost forced as I am the target audience, too.
>
> To answer your initial question, after reading a lot of documentation I
> find the term execution_date really counterintuitive, so yes maybe
> period_start and period_end might be a better naming to help to understand
> how all the initial scheduling works. Because even after reading the
> scheduling section of the doc and the FAQ, it was still not clear in my
> mind. Btw, I find some ideas exposed by James Meickle in the [DISCUSS]
> AIRFLOW-4192 very interesting and I share his opinion that there's still
> room for improvement.
> But a mode to change from "run at end of period, I need all the data
> available for this period" (the current) to "run at _this_ time on the
> schedule_interval would be awesome.
>
> Regards,
> Flo
>
> On Tue, Apr 9, 2019 at 4:41 PM Ash Berlin-Taylor <ash@apache.org> wrote:
>
> > Yeah, that's the other thing that has been talked about from
> time-to-time,
> > which is a mode to change from "run at end of period, I need all the data
> > available for this period" (the current) to "run at _this_ time on the
> > schedule_interval, don't wait for the period to end".
> >
> > (No such flag exists right now, before you go looking.)
> >
> > > On 9 Apr 2019, at 15:31, Shaw, Damian P. <
> > damian.shaw.2@credit-suisse.com> wrote:
> > >
> > > Hi all,
> > >
> > > I'm new to this Airflow Dev mailing list so I wasn't expecting to reply
> > to anything but I feel I am the target audience for this question. I am
> > quite new to airflow and have been setting up an airflow environment for
> my
> > business this last month.
> > >
> > > I find the current "execution_date" a small technical burden and a
> large
> > cognitive burden. Our workflow is based on DAGs running at a specified
> time
> > in a specified timezone using the same date as the current calendar date.
> > >
> > > I have worked around this by creating my own macro and context
> > variables, with the logic looking like this:
> > >        airflow_execution_date = context['execution_date']
> > >        dag_timezone = context['dag'].timezone
> > >        local_execution_date =
> > dag_timezone.convert(airflow_execution_date)
> > >        local_cal_date = local_execution_date +
> datetime.timedelta(days=1)
> > >
> > > As you can see this isn't a lot of technical effort, but having a date
> > that 1) is in the timezone the business users are working in, and 2) Is
> the
> > same calendar date the business users are working in it significantly
> > reduces the cognitive effort required to set-up tasks. Of course this
> > doesn't help with cron format scheduling which I just let the business
> give
> > me the requirements for and I set it up myself as the date logic there is
> > still confusing as it doesn't work like real cron scheduling which
> everyone
> > is familiar with.
> > >
> > > Maybe "period_start" and "period_end" might help people on Day 0 of
> > understanding Airflow get that the dates you are dealing with are not
> what
> > you expect, but Day 1+ there's still a lot of cognitive overhead if you
> > don't have the exact same model as AirBnb for running DAGs and tasks.
> > >
> > > My 2 cents anyway,
> > > Damian Shaw
> > >
> > >
> > > -----Original Message-----
> > > From: Ash Berlin-Taylor [mailto:ash@apache.org]
> > > Sent: Tuesday, April 09, 2019 10:08 AM
> > > To: dev@airflow.apache.org
> > > Subject: [DISCUSS] period_start/period_end instead of
> > execution_date/next_execution_date
> > >
> > > (trying to break this out in to another thread)
> > >
> > > The ML doesn't allow  images, but I can guess that it is the deps
> > section of a task instance details screen?
> > >
> > > I'm not saying it's not clear once you know to look there, but I'm
> > trying remove/reduce the confusion in the first place. And I think we as
> > committers aren't best placed to know what makes sense as we have
> > internalised how Airflow works :)
> > >
> > > So I guess this is a question to the newest people on the list: Would
> > `period_start` and `period_end` be more or less confusing for you when
> you
> > were first getting started with Airflow?
> > >
> > > -ash
> > >
> > >> On 9 Apr 2019, at 14:47, Driesprong, Fokko <fokko@driesprong.frl>
> > wrote:
> > >>
> > >> Ash,
> > >>
> > >> Personally, I think this is quite clear, there is a list of reasons
> why
> > the job isn't being scheduled:
> > >>
> > >>
> > >> Coming back to the question of Bas, I believe that yesterday_ds does
> > not make sense since we cannot assume that the schedule is daily. I don't
> > see any usage of this variable. Personally, I do use next_execution_date
> > quite extensively. When you have a job that runs daily, but you want to
> > change this to an hourly job. In such a case you don't want to change {{
> > (execution_date + macros.timedelta(days=1)) }} to {{ (execution_date +
> > macros.timedelta(hours=1)) }} everywhere.
> > >>
> > >> I'm just not sure if the aggressive deprecation of is really worth it.
> > I don't see too much harm in letting them stay.
> > >>
> > >> Cheers, Fokko
> > >>
> > >> Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor <ash@apache.org
> > <mailto:ash@apache.org>>:
> > >> To (slightly) hijack this thread:
> > >>
> > >> On the subject of execuction_date: as I'm sure we're all aware the
> > concept of execution_date is confusing to new-commers to Airflow (there
> are
> > many questions about "why hasn't my DAG run yet"? "Why is my dag a day
> > behind?" etc.) and although we mention this in the docs it's a confusing
> > concept.
> > >>
> > >> What to people think about adding two new parameters: `period_start`
> > and `period_end` and making these the preferred terms in place of
> > execution_date and next_execution_date?
> > >>
> > >> This hopefully avoids any ambitious terms like "execution" or "run"
> > which is understandably easy to conflate with the time the task is being
> > run (i.e. `now()`)
> > >>
> > >> If people think this naming is better and less confusing I would
> > suggest we update all the docs and examples to use these terms (but still
> > mention the old names somewhere, probably in the macros docs)
> > >>
> > >> Thoughts?
> > >>
> > >> -ash
> > >>
> > >>
> > >>> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wiedmer@gmail.com
> > <mailto:arthur.wiedmer@gmail.com>> wrote:
> > >>>
> > >>> Hi Bas,
> > >>>
> > >>> 1) I am aware of a few places where those parameters are used in
> > production
> > >>> in a few hundred jobs. I highly recommend we don't deprecate them
> > unless we
> > >>> do it in a major version.
> > >>>
> > >>> 2) As James mentioned, inlets and outlets are a lineage annotation
> > feature
> > >>> which is still under development. Let's leave them in, but we can
> guard
> > >>> them behind a feature flag if you prefer.
> > >>>
> > >>> 3) the yesterday*/tomorrow* params are convenience ones if you use
a
> > daily
> > >>> ETL. I agree with you that they are simple to compute, but not
> everyone
> > >>> using Apache Airflow is amazing with Python. Some users are only
> > trying to
> > >>> get a query scheduled and rely on a couple of niceties like these to
> > get by.
> > >>>
> > >>> 4) latest_date, end_date (I feel like there used to be start_date,
> but
> > >>> maybe it got lost) were a blend of things which were used by a
> backfill
> > >>> framework used internally at Airbnb. Latest date was used if you
> > needed to
> > >>> join to a dimension for which you only wanted the latest version of
> the
> > >>> attributes in you backfill. end_date was used for time ranges where
> > several
> > >>> days were processed together in a range to save on compute. I don't
> > see an
> > >>> issue with removing them.
> > >>>
> > >>> Best regards,
> > >>> Arthur
> > >>>
> > >>>
> > >>>
> > >>> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
> > basharenslak@godatadriven.com <mailto:basharenslak@godatadriven.com>>
> > >>> wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> Following Tao Feng’s question to discuss this PR<
> > >>>> https://github.com/apache/airflow/pull/5010 <
> > https://github.com/apache/airflow/pull/5010>> (AIRFLOW-4192<
> > >>>> https://issues.apache.org/jira/browse/AIRFLOW-4192 <
> > https://issues.apache.org/jira/browse/AIRFLOW-4192>>), please discuss
> here
> > >>>> if you agree/disagree/would change.
> > >>>>
> > >>>> -----------
> > >>>>
> > >>>> The summary of the PR:
> > >>>>
> > >>>> I was confused by the task context values and suggest to clean
up
> and
> > >>>> clarify these variables. Some are derivations from other variables,
> > some
> > >>>> are undocumented and unused, some are wrong (name doesn’t match
the
> > value).
> > >>>> Please discuss what you think of the removal of these variables:
> > >>>>
> > >>>>
> > >>>> *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> > >>>> tomorrow_ds_nodash. IMO the next_* and previous_* variables are
> useful
> > >>>> since these require complex logic to compute the next execution
> date,
> > >>>> however would leave computing the yesterday* and tomorrow* variables
> > up to
> > >>>> the user since they are simple one-liners and don't relate to the
> DAG
> > >>>> interval.
> > >>>> *   Removed tables. This is a field in params, and is thus also
> > >>>> accessible by the user ({{ params.tables }}). Also, it was
> > undocumented.
> > >>>> *   Removed latest_date. It's the same as ds and was also
> > undocumented.
> > >>>> *   Removed inlets and outlets. Also undocumented, and have the
> > >>>> inlets/outlets ever worked/ever been used by anybody?
> > >>>> *   Removed end_date and END_DATE. Both have the same value, so
it
> > >>>> doesn't make sense to have both variables. Also, the value is ds
> which
> > >>>> contains the start date of the interval, so the naming didn't make
> > sense to
> > >>>> me. However, if anybody argues in favour of adding "start_date"
and
> > >>>> "end_date" to provide the start and end datetime of task instance
> > >>>> intervals, I'd be happy to add them.
> > >>>>
> > >>>> Cheers,
> > >>>> Bas
> > >>>>
> > >>
> > >
> > >
> > >
> > >
> >
> ===============================================================================
> >
> > > Please access the attached hyperlink for an important electronic
> > communications disclaimer:
> > > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > >
> >
> ===============================================================================
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message