airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From airflowuser <airflowu...@protonmail.com.INVALID>
Subject Re: [DISCUSS] period_start/period_end instead of execution_date/next_execution_date
Date Mon, 15 Apr 2019 12:52:08 GMT
To quote my user-experience professor from ages ago:
"If too many people misuse something you wrote it means that YOU are doing something wrong".

Something can be well documented but if it's not intuitive it's likely that people will get
it wrong.

Say someone ask "When did you execute the code?" Your answer will be direct - the time the
code started to run. This is why so many people misunderstand the execution_date in the terms
of Airflow. Airflow took a word that is well defined in our conscious and replaced it's meaning.


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, April 15, 2019 3:35 PM, Dan Davydov <ddavydov@twitter.com.INVALID> wrote:

> I think if the mission of Airflow is to be a generic Workflow engine, the
> current semantics of execution date aren't a good default. This might be an
> unpopular opinion given past threads on this topic :).
>
> The execution_date = end_date semantics make sense for the ETL use case but
> not for other use cases I think Cron syntax is more intuitive to users,
> i.e. start_date should match execution_date (although I don't have data to
> back this up). This is especially prevalent in ML, it's almost a rite of
> passage for users to get confused by execution date semantics. I think a
> flag to support different execution date semantics makes sense, even at the
> cost of being a headache to support both and the complexity increase could
> lead to bugs and trickier mailing list support.
>
> On Wed, Apr 10, 2019 at 9:00 PM Gabriel Silk gsilk@dropbox.com.invalid
> wrote:
>
> > My two cents:
> > "execution_date" is definitely confusing to newcomers, and it's partly the
> > ambiguity of the wording, and partly the UI's fault. When I first saw
> > execution date, I assumed it meant the earliest time at which the task
> > will execute, which is wrong. I was confused when no tasks appeared for3pm until
4pm.
> > My proposal to fix that:
> >
> > 1.  Always show the next task to be executed in the UI, but explain to the
> >     user that it's not running because its interval has not yet completed.
> >     Indicate this state visually, perhaps by using some transparency or another
> >     color.
> >
> > 2.  Instead of just showing execution date in the UI, show the low/high
> >     range of the time period it covers (for periodic jobs).
> >
> >
> > As for what we call the low/high timestamps, I like these two options:
> >
> > -   low_ts, high_ts
> > -   interval_start, interval_end
> >
> > On Wed, Apr 10, 2019 at 6:43 AM James Meickle
> > jmeickle@quantopian.com.invalid wrote:
> >
> > > Strictly tying execution start to interval end doesn't work for some
> > > workflows (my guess, 1-5% of them?):
> > >
> > > -   You need to start performing tasks before the interval is over
> > > -   You have tasks that reference a single interval, but can't be completed
> > >     until several intervals later (due to data latency)
> > >
> > > -   The frequency you need to run the task on is different than the
> > >     frequency
> > >     of the interval you need to process (like processing all records from the
> > >     last five days, every day)
> > >
> > >
> > > Airflow doesn't handle any of these situations gracefully and I've seen
> > > people attempt all sorts of workarounds for them. Probably even more
> > > people
> > > would try, if we provided decent idioms for doing it rather than those
> > > workarounds.
> > > On Wed, Apr 10, 2019 at 9:30 AM Driesprong, Fokko fokko@driesprong.frl
> > > wrote:
> > >
> > > > I see what you mean. I don't really like the `period_{start,end}` name,
> > > > but
> > > > something such as `interval_{start,end}` might do it for me.
> > > > Personally, I think running the job after the interval closes (since
> > > > then
> > >
> > > > you have all the data over the interval), makes complete sense for ETL
> > > > jobs. I agree it requires some time to get used to. Maybe we're lacking
> > > > on
> > > > documentation here.
> > > > Cheers, Fokko
> > > > Op wo 10 apr. 2019 om 10:08 schreef Flo Rance trourance@gmail.com:
> > > >
> > > > > I didn't expect to participate at any debate on that software, as
> > > > > I'm a
> > >
> > > > > complete newcomer. But I'm almost forced as I am the target audience,
> > > > > too.
> > > > > To answer your initial question, after reading a lot of
> > > > > documentation I
> > >
> > > > > find the term execution_date really counterintuitive, so yes maybe
> > > > > period_start and period_end might be a better naming to help to
> > > > > understand
> > > > > how all the initial scheduling works. Because even after reading
the
> > > > > scheduling section of the doc and the FAQ, it was still not clear
in
> > > > > my
> > >
> > > > > mind. Btw, I find some ideas exposed by James Meickle in the
> > > > > [DISCUSS]
> > >
> > > > > AIRFLOW-4192 very interesting and I share his opinion that there's
> > > > > still
> > > >
> > > > > room for improvement.
> > > > > But a mode to change from "run at end of period, I need all the data
> > > > > available for this period" (the current) to "run at this time on
> > > > > the
> > >
> > > > > schedule_interval would be awesome.
> > > > > Regards,
> > > > > Flo
> > > > > On Tue, Apr 9, 2019 at 4:41 PM Ash Berlin-Taylor ash@apache.org
> > > > > wrote:
> > > >
> > > > > > Yeah, that's the other thing that has been talked about from
> > > > > > time-to-time,
> > > > > > which is a mode to change from "run at end of period, I need
all
> > > > > > the
> > >
> > > > data
> > > >
> > > > > > available for this period" (the current) to "run at this time
on
> > > > > > the
> > > >
> > > > > > schedule_interval, don't wait for the period to end".
> > > > > > (No such flag exists right now, before you go looking.)
> > > > > >
> > > > > > > On 9 Apr 2019, at 15:31, Shaw, Damian P. <
> > > > > > > damian.shaw.2@credit-suisse.com> wrote:
> > > > > > > Hi all,
> > > > > > > I'm new to this Airflow Dev mailing list so I wasn't expecting
to
> > > > > > > reply
> > > > >
> > > > > > to anything but I feel I am the target audience for this question.
> > > > > > I
> > > > > > am
> > >
> > > > > > quite new to airflow and have been setting up an airflow
> > > > > > environment
> > >
> > > > for
> > > >
> > > > > my
> > > > >
> > > > > > business this last month.
> > > > > >
> > > > > > > I find the current "execution_date" a small technical burden
and
> > > > > > > a
> > >
> > > > > large
> > > > >
> > > > > > cognitive burden. Our workflow is based on DAGs running at a
> > > > > > specified
> > > >
> > > > > time
> > > > >
> > > > > > in a specified timezone using the same date as the current calendar
> > > > > > date.
> > > > >
> > > > > > > I have worked around this by creating my own macro and
context
> > > > > > > variables, with the logic looking like this:
> > > > > > > airflow_execution_date = context['execution_date']
> > > > > > > dag_timezone = context['dag'].timezone
> > > > > > > local_execution_date =
> > > > > > > dag_timezone.convert(airflow_execution_date)
> > > > > > > local_cal_date = local_execution_date +
> > > > > > > datetime.timedelta(days=1)
> > > > > >
> > > > > > > As you can see this isn't a lot of technical effort, but
having a
> > > > > > > date
> > > > >
> > > > > > that 1) is in the timezone the business users are working in,
and
> >
> > 2.
> >
> > > Is
> > >
> > > > > the
> > > > >
> > > > > > same calendar date the business users are working in it
> > > > > > significantly
> > >
> > > > > > reduces the cognitive effort required to set-up tasks. Of course
> > > > > > this
> > >
> > > > > > doesn't help with cron format scheduling which I just let the
> > > > > > business
> > > >
> > > > > give
> > > > >
> > > > > > me the requirements for and I set it up myself as the date logic
> > > > > > there
> > > > > > is
> > > >
> > > > > > still confusing as it doesn't work like real cron scheduling
which
> > > > > > everyone
> > > > > > is familiar with.
> > > > > >
> > > > > > > Maybe "period_start" and "period_end" might help people
on Day 0
> > > > > > > of
> > >
> > > > > > understanding Airflow get that the dates you are dealing with
are
> > > > > > not
> > >
> > > > > what
> > > > >
> > > > > > you expect, but Day 1+ there's still a lot of cognitive overhead
if
> > > > > > you
> > > >
> > > > > > don't have the exact same model as AirBnb for running DAGs and
> > > > > > tasks.
> > >
> > > > > > > My 2 cents anyway,
> > > > > > > Damian Shaw
> > > > > > > -----Original Message-----
> > > > > > > From: Ash Berlin-Taylor [mailto:ash@apache.org]
> > > > > > > Sent: Tuesday, April 09, 2019 10:08 AM
> > > > > > > To: dev@airflow.apache.org
> > > > > > > Subject: [DISCUSS] period_start/period_end instead of
> > > > > > > execution_date/next_execution_date
> > > > > > > (trying to break this out in to another thread)
> > > > > > > The ML doesn't allow images, but I can guess that it is
the deps
> > > > > > > section of a task instance details screen?
> > > > > > > I'm not saying it's not clear once you know to look there,
but
> > > > > > > I'm
> > >
> > > > > > trying remove/reduce the confusion in the first place. And I
think
> > > > > > we
> > >
> > > > as
> > > >
> > > > > > committers aren't best placed to know what makes sense as we
have
> > > > > > internalised how Airflow works :)
> > > > > >
> > > > > > > So I guess this is a question to the newest people on the
list:
> > > > > > > Would
> > > >
> > > > > > `period_start` and `period_end` be more or less confusing for
you
> > > > > > when
> > > >
> > > > > you
> > > > >
> > > > > > were first getting started with Airflow?
> > > > > >
> > > > > > > -ash
> > > > > > >
> > > > > > > > On 9 Apr 2019, at 14:47, Driesprong, Fokko <fokko@driesprong.frl
> > >
> > > > > > wrote:
> > > > > >
> > > > > > > > Ash,
> > > > > > > > Personally, I think this is quite clear, there is
a list of
> > > > > > > > reasons
> > > >
> > > > > why
> > > > >
> > > > > > the job isn't being scheduled:
> > > > > >
> > > > > > > > Coming back to the question of Bas, I believe that
yesterday_ds
> > > > > > > > does
> > > >
> > > > > > not make sense since we cannot assume that the schedule is daily.
I
> > > > > > don't
> > > > >
> > > > > > see any usage of this variable. Personally, I do use
> > > > > > next_execution_date
> > > > >
> > > > > > quite extensively. When you have a job that runs daily, but
you
> > > > > > want
> > > > > > to
> > >
> > > > > > change this to an hourly job. In such a case you don't want
to
> > > > > > change
> > >
> > > > {{
> > > >
> > > > > > (execution_date + macros.timedelta(days=1)) }} to {{
> > > > > > (execution_date
> > >
> > > -
> > >
> > > > > > macros.timedelta(hours=1)) }} everywhere.
> > > > > >
> > > > > > > > I'm just not sure if the aggressive deprecation of
is really
> > > > > > > > worth
> > >
> > > > it.
> > > >
> > > > > > I don't see too much harm in letting them stay.
> > > > > >
> > > > > > > > Cheers, Fokko
> > > > > > > > Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor
<
> > > > > > > > ash@apache.org
> > > > >
> > > > > > mailto:ash@apache.org>:
> > > > > >
> > > > > > > > To (slightly) hijack this thread:
> > > > > > > > On the subject of execuction_date: as I'm sure we're
all aware
> > > > > > > > the
> > >
> > > > > > concept of execution_date is confusing to new-commers to Airflow
> > > > > > (there
> > > >
> > > > > are
> > > > >
> > > > > > many questions about "why hasn't my DAG run yet"? "Why is my
dag a
> > > > > > day
> > > >
> > > > > > behind?" etc.) and although we mention this in the docs it's
a
> > > > > > confusing
> > > > >
> > > > > > concept.
> > > > > >
> > > > > > > > What to people think about adding two new parameters:
> > > > > > > > `period_start`
> > > >
> > > > > > and `period_end` and making these the preferred terms in place
of
> > > > > > execution_date and next_execution_date?
> > > > > >
> > > > > > > > This hopefully avoids any ambitious terms like "execution"
or
> > > > > > > > "run"
> > > >
> > > > > > which is understandably easy to conflate with the time the task
is
> > > > > > being
> > > > >
> > > > > > run (i.e. `now()`)
> > > > > >
> > > > > > > > If people think this naming is better and less confusing
I would
> > > > > > > > suggest we update all the docs and examples to use
these terms (but
> > > > > > > > still
> > > > >
> > > > > > mention the old names somewhere, probably in the macros docs)
> > > > > >
> > > > > > > > Thoughts?
> > > > > > > > -ash
> > > > > > > >
> > > > > > > > > On 8 Apr 2019, at 16:20, Arthur Wiedmer <
> > > > > > > > > arthur.wiedmer@gmail.com
> > > >
> > > > > > mailto:arthur.wiedmer@gmail.com> wrote:
> > > > > >
> > > > > > > > > Hi Bas,
> > > > > > > > >
> > > > > > > > > 1.  I am aware of a few places where those parameters
are used
> > > > > > > > >     in
> > > > > > > > >
> > >
> > > > > > production
> > > > > >
> > > > > > > > > in a few hundred jobs. I highly recommend we
don't deprecate
> > > > > > > > > them
> > >
> > > > > > unless we
> > > > > >
> > > > > > > > > do it in a major version.
> > > > > > > > >
> > > > > > > > > 2.  As James mentioned, inlets and outlets are
a lineage
> > > > > > > > >     annotation
> > > > > > > > >
> > > >
> > > > > > feature
> > > > > >
> > > > > > > > > which is still under development. Let's leave
them in, but we
> > > > > > > > > can
> > >
> > > > > guard
> > > > >
> > > > > > > > > them behind a feature flag if you prefer.
> > > > > > > > >
> > > > > > > > > 3.  the yesterday*/tomorrow* params are convenience
ones if you
> > > > > > > > >     use
> > > > > > > > >     a
> > > > > > > > >
> > > >
> > > > > > daily
> > > > > >
> > > > > > > > > ETL. I agree with you that they are simple to
compute, but not
> > > > > > > > > everyone
> > > > > >
> > > > > > > > > using Apache Airflow is amazing with Python.
Some users are
> > > > > > > > > only
> > >
> > > > > > trying to
> > > > > >
> > > > > > > > > get a query scheduled and rely on a couple of
niceties like
> > > > > > > > > these
> > >
> > > > to
> > > >
> > > > > > get by.
> > > > > >
> > > > > > > > > 4.  latest_date, end_date (I feel like there
used to be
> > > > > > > > >     start_date,
> > > > > > > > >
> > > >
> > > > > but
> > > > >
> > > > > > > > > maybe it got lost) were a blend of things which
were used by a
> > > > > > > > > backfill
> > > > > >
> > > > > > > > > framework used internally at Airbnb. Latest date
was used if
> > > > > > > > > you
> > >
> > > > > > needed to
> > > > > >
> > > > > > > > > join to a dimension for which you only wanted
the latest
> > > > > > > > > version
> > > > > > > > > of
> > >
> > > > > the
> > > > >
> > > > > > > > > attributes in you backfill. end_date was used
for time ranges
> > > > > > > > > where
> > > >
> > > > > > several
> > > > > >
> > > > > > > > > days were processed together in a range to save
on compute. I
> > > > > > > > > don't
> > > >
> > > > > > see an
> > > > > >
> > > > > > > > > issue with removing them.
> > > > > > > > > Best regards,
> > > > > > > > > Arthur
> > > > > > > > > On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak
<
> > > > > > > > > basharenslak@godatadriven.com <mailto:
> > > > > > > > > basharenslak@godatadriven.com
> > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > > Following Tao Feng’s question to discuss
this PR<
> > > > > > > > > > https://github.com/apache/airflow/pull/5010
<
> > > > > > > > > > https://github.com/apache/airflow/pull/5010>>
(AIRFLOW-4192<
> > > > > > >
> > > > > > > > > > https://issues.apache.org/jira/browse/AIRFLOW-4192
<
> > > > > > > > > > https://issues.apache.org/jira/browse/AIRFLOW-4192>>),
please
> > > > > > > > > > discuss
> > > >
> > > > > here
> > > > >
> > > > > > > > > > if you agree/disagree/would change.
> > > > > > > > > >
> > > > > > > > > > The summary of the PR:
> > > > > > > > > > I was confused by the task context values
and suggest to clean
> > > > > > > > > > up
> > > >
> > > > > and
> > > > >
> > > > > > > > > > clarify these variables. Some are derivations
from other
> > > > > > > > > > variables,
> > > > >
> > > > > > some
> > > > > >
> > > > > > > > > > are undocumented and unused, some are wrong
(name doesn’t
> > > > > > > > > > match
> > >
> > > > the
> > > >
> > > > > > value).
> > > > > >
> > > > > > > > > > Please discuss what you think of the removal
of these
> > > > > > > > > > variables:
> > >
> > > > > > > > > > -   Removed yesterday_ds, yesterday_ds_nodash,
tomorrow_ds,
> > > > > > > > > >     tomorrow_ds_nodash. IMO the next_* and
previous_* variables
> > > > > > > > > >     are
> > > > > > > > > >
> > >
> > > > > useful
> > > > >
> > > > > > > > > > since these require complex logic to compute
the next
> > > > > > > > > > execution
> > >
> > > > > date,
> > > > >
> > > > > > > > > > however would leave computing the yesterday*
and tomorrow*
> > > > > > > > > > variables
> > > > >
> > > > > > up to
> > > > > >
> > > > > > > > > > the user since they are simple one-liners
and don't relate to
> > > > > > > > > > the
> > > >
> > > > > DAG
> > > > >
> > > > > > > > > > interval.
> > > > > > > > > >
> > > > > > > > > > -   Removed tables. This is a field in params,
and is thus
> > > > > > > > > >     also
> > > > > > > > > >
> > >
> > > > > > > > > > accessible by the user ({{ params.tables
}}). Also, it was
> > > > > > > > > > undocumented.
> > > > > > >
> > > > > > > > > > -   Removed latest_date. It's the same as
ds and was also
> > > > > > > > > >     undocumented.
> > > > > > > > > >
> > > > > > >
> > > > > > > > > > -   Removed inlets and outlets. Also undocumented,
and have
> > > > > > > > > >     the
> > > > > > > > > >
> > >
> > > > > > > > > > inlets/outlets ever worked/ever been used
by anybody?
> > > > > > > > > >
> > > > > > > > > > -   Removed end_date and END_DATE. Both
have the same value,
> > > > > > > > > >     so
> > > > > > > > > >     it
> > > > > > > > > >
> > >
> > > > > > > > > > doesn't make sense to have both variables.
Also, the value is
> > > > > > > > > > ds
> > >
> > > > > which
> > > > >
> > > > > > > > > > contains the start date of the interval,
so the naming didn't
> > > > > > > > > > make
> > > >
> > > > > > sense to
> > > > > >
> > > > > > > > > > me. However, if anybody argues in favour
of adding
> > > > > > > > > > "start_date"
> > >
> > > > and
> > > >
> > > > > > > > > > "end_date" to provide the start and end
datetime of task
> > > > > > > > > > instance
> > > >
> > > > > > > > > > intervals, I'd be happy to add them.
> > > > > > > > > > Cheers,
> > > > > > > > > > Bas
> >
> > ===============================================================================
> >
> > > > > > > Please access the attached hyperlink for an important electronic
> > > > > > > communications disclaimer:
> > > > > > > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> >
> > ===============================================================================
> >
> > > > > >



Mime
View raw message