airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Palmer <ch...@crpalmer.com>
Subject Re: execution_date - can we stop the confusion?
Date Thu, 27 Sep 2018 17:56:23 GMT
While taking a step back makes some sense, we also need to identify what
the issue is. Simply saying 'execution_date behavior is confusing to new
users' isn't good enough. What is confusing about it? Is it what it
represents, or just the name itself?

There are a number of different timestamps that might be of interest,
including (but not limited to):

*Identifying timestamp*
For any time interval, there are two natural choices of timestamps to
represent that interval, the left and right bounds. For Airflow the left
bound has been chosen, and is called execution_date. For various reasons, I
think that makes a much better choice than the right bound.

*Create/update/delete timestamps*
Timestamps representing when particular database records where created,
updated and or deleted. I don't believe that Airflow currently records
these.

*Runtime timestamps*
The timestamps that a task or other process started and stopped. Airflow
records these for Tasks, but I think the implementation is maybe a little
lacking for DagRuns.


So what's the confusion with execution_date? Is it what it represents or
the name itself?

I think part of the learning curve with Airflow is understanding that
execution_date is the left bound of the interval. No matter what name you
use for the identifying timestamp I think new users will need to learn what
that choice means. Changing the name won't magically make all the confusion
go away.

While I don't think execution_date is the greatest name in the world, it's
a lot better than the suggested alternative run_stamped. Tasks also have an
identifying timestamp, and if I saw run_stamped on a Task I would have no
idea what it means (stamped by what?).

While there may be better names than execution_date, I don't think they are
so much better that it is worth the effort to overhaul such an integral
part of Airflow. Maybe some improvements to the documentation could be
made, but nothing so drastic as to renaming such a core item.


As for the second suggestion to add "a new variable which indicated the
actual datetime when the DAG run was generated. call it
execution_start_date". It is very unclear what the desired outcome is with
this.

To me "generated" implies creation time, i.e. recorded in the database.
However, creation of a DagRun record in the database is a distinct event
from when Tasks associated with that DagRun start executing. Plus DagRuns
themselves don't actually "run" - Tasks are the only thing that really gets
run by Airflow.

What is actually desired here?
 - The right bound of the schedule interval?
 - The time the DagRun was created?
 - The time that any Tasks associated with a DagRun were first considered
by the scheduler?
 - The time that any Tasks associated with a DagRun were first scheduled?
 - The time that any Tasks associated with a DagRun were actually started
by a worker?


The lack of clarity and completeness around these suggestions, alongside
inane declarations like "This name won't cause people to get confused" is
hardly a good way to get people to take suggestions seriously.

Chris


On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman <waksman@gmail.com>
wrote:

> This comes up a lot. I've seen it on this mailing list multiple times and
> it's something that I have to explicitly call out to every single person
> that I've helped train up on Airflow.
>
> If we take a moment to set aside why things are the way they are, what the
> documentation says, and how experienced users feel things should behave;
> there still remains the fact that a lot of new users get confused by how
> "execution_date" works.
>
> Whether it's a problem, whether we need to do something, and what we could
> do are all separate questions but I think it's important that we
> acknowledge and start from:
>
> A lot of new users get confused by how "execution_date" works.
>
> I recognize that some of this is a learning curve issue and some of this is
> a mindset issue but it begs the question: do enough users benefit from the
> current structure to justify the harm to new users?
>
> --George
>
> On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
> brian@heisenbergwoodworking.com> wrote:
>
> > It took a minute to grok, but in the larger context of how af works it
> > makes perfect sense the way it is.  Changing something so fundamentally
> > breaking to every dag in existence should bring a comparable benefit.
> > Beyond the avoiding teaching a concept you disagree with, what benefits
> > does the proposal bring to offset the cost of change?
> >
> > I’m gonna make a meme - “do you even airflow bro?”
> >
> > Sent from a device with less than stellar autocorrect
> >
> > > On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> > >
> > > I think if you have a functional mindset (as in "functional data
> > engineering
> > > <
> >
> https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
> > >")
> > > as opposed to a cron mindset, using the left bound of the time interval
> > > makes a lot of sense. Things like your daily table partition keys align
> > > with your Airflow execution_date.
> > >
> > > The main thing is that whatever we do we cannot break backwards
> > > compatibility. Offering both views (left bound/right bound), as it's
> been
> > > proposed before, either as an environment setting or a user personal
> > > preference is even more confusing to me personally. Users would have to
> > > switch context as they help each other or change environments.
> > >
> > > Also note that your intuition may differ from other people's intuition,
> > and
> > > that "unlearning" something is way harder than learning something.
> > >
> > > My personal take on this is to make this a rite of passage. This is
> just
> > > one of the many thing you have to learn when learning Airflow.
> > >
> > > Max
> > >
> > >> On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin <hussam.elamin@gmail.com>
> > wrote:
> > >>
> > >> Hi Bolke
> > >>
> > >> Speaking as a consultant who is constantly training other teams how to
> > use
> > >> airflow, I do frequently see this confusion.
> > >> Another one is how the batch_date is always batch_date + interval or
> as
> > the
> > >> docs make it quite clear
> > >>
> > >> "*Let’s Repeat That* The scheduler runs your job one schedule_interval
> > >> AFTER
> > >> the start date, at the END of the period."
> > >>
> > >> Renaming it would make it simpler for newbies, but essentially they
> will
> > >> need to understand how Airflow behaves, execution_date being the batch
> > >> execution date not the run_date of the DAG
> > >>
> > >> I am actually in the process of writing a blog post
> > >> <
> https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
> > >> about this which I could use peoples feedback
> > >>
> > >> If it helps, I find that explaining how backfills work and why they
> are
> > >> important will drive home what the execution_date is :)
> > >>
> > >>
> > >> Regards
> > >> Sam
> > >>
> > >>
> > >>
> > >>> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin <bdbruin@gmail.com>
> > wrote:
> > >>>
> > >>> I dont think this makes sense and I dont that think anyone had a real
> > >>> issue with this. Execution date has been clearly documented  and is
> > part
> > >> of
> > >>> the core principles of airflow. Renaming will create more confusion.
> > >>>
> > >>> Please note that I do think that as an anonymous user you cannot
> speak
> > >> for
> > >>> any "new airflow user". That is a contradiction to me.
> > >>>
> > >>> Thanks
> > >>> Bolke
> > >>>
> > >>> Sent from my iPhone
> > >>>
> > >>>> On 26 Sep 2018, at 07:59, airflowuser <airflowuser@protonmail.com
> > >> .INVALID>
> > >>> wrote:
> > >>>>
> > >>>> One of the most annoying, hard to understand and against all common
> > >>> sense is the execution_date behavior. I assume that any new Airflow
> > user
> > >>> has been struggling with it.
> > >>>> The amount of questions with answers referring to :
> > >>> https://airflow.apache.org/scheduler.html?scheduling-triggers  is
> > >>> uncountable.
> > >>>>
> > >>>> Most people mistakenly think that execution_date is the datetime
> which
> > >>> the DAG started to run.
> > >>>>
> > >>>> I suggest the following changes:
> > >>>> 1. Renaming the execution_date to something else like: run_stamped
> > >>> This name won't cause people to get confused.
> > >>>> 2. Adding a new variable which indicated the actual datetime when
> the
> > >>> DAG run was generated. call it execution_start_date. People seem to
> > want
> > >>> the information when the DAG actually started to be executed/run.
> > >>>>
> > >>>> This is only naming changes. No need to actual change the behavior
-
> > >>> This will only make things simpler as when user encounter
> run_stamped
> > >> he
> > >>> won't be confused by the name like execution_date
> > >>>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message