airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Standish <dpstand...@gmail.com>
Subject Re: Setting to add choice of schedule at end or schedule at start of interval
Date Wed, 04 Sep 2019 03:14:03 GMT
What if we merely add a property "run_date" to DagRun?  At present
this would be essentially same as "next_execution_date".

Then no change to scheduler would be required, and no new dag parameter or
config.  Perhaps you could add a toggle to the DAGs UI view that lets you
choose whether to display "last run" by "run_date" or "execution_date".

If you want your dags to be parameterized by the date when they meant to be
run -- as opposed to their implicit interval-of-interest -- then you can
reference "run_date".

One potential source of confusion with this is backfilling: what does
"run_date" mean in the context of a backfill?  You could say it means
essentially "initial run date", i.e. "do not run before date", i.e. "run
after date" or "run-at date".  So, for a daily, job the 2019-01-02
"run_date" corresponds to a 2019-01-01 execution_date.  This makes sense
right?

Perhaps in the future, the relationship between "run_date" and
"execution_date" can be more dynamic.  Perhaps in the future we rename
"execution_date" for clarity, or to be more generic.  But it makes sense
that a dag run will always have a run date, so it doesn't seem like a
terrible idea to add a property representing this.

Would this meet the goals of the PR?




On Wed, Aug 28, 2019 at 11:50 AM James Meickle
<jmeickle@quantopian.com.invalid> wrote:

> Totally agree with Daniel here. I think that if we implement this feature
> as proposed, it will actively discourage us from implementing a better
> data-aware feature that would remain invisible to most users while neatly
> addressing a lot of edge cases that currently require really ugly hacks. I
> believe that having more data awareness features in Airflow (like the data
> lineage work, or other metadata integrations) is worth investing in if we
> can do it without too much required user-facing complexity. The Airflow
> project isn't a full data warehouse suite but it's also not just "cron with
> a UI", so we should try to be pragmatic and fit in power-user features
> where we can do so without compromising the project's overall goals.
>
> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dpstandish@gmail.com>
> wrote:
>
> > I am just thinking there is the potential for a more comprehensive
> > enhancement here, and I worry that this is a band-aid that, like all new
> > features has the potential to constrain future options.  It does not help
> > us to do anything we cannot already do.
> >
> > The source of this problem is that scheduling and interval-of-interest
> are
> > mixed together.
> >
> > My thought is there may be a way to separate scheduling and
> > interval-of-interest to uniformly resolve "execution_date" vs "run_date"
> > confusion.  We could make *explicit* instead of *implicit* the
> relationship
> > between run_date *(not currently a concept in airflow)* and
> > "interval-of-interest" *(currently represented by execution_date)*.
> >
> > I also see in this the potential to unlock some other improvements:
> > * support a greater diversity of incremental processes
> > * allow more flexible backfilling
> > * provide better views of data you have vs data you don't.
> >
> > The canonical airflow job is date-partitioned idempotent data pull.  Your
> > interval of interest is from execution_date to execution_date + 1
> > interval.  Schedule_interval is not just the scheduling cadence but it is
> > also your interval-of-interest partition function.   If that doesn't work
> > for your job, you set catchup=False and roll your own.
> >
> > What if there was a way to generalize?  E.g. could we allow for more
> > flexible partition function that deviated from scheduler cadence?  E.g.
> > what if your interval-of-interest partitions could be governed by "min 1
> > day, max 30 days".  Then on on-going basis, your daily loads would be a
> > range of 1 day but then if server down for couple days, this could be
> > caught up in one task and if you backfill it could be up to 30-day
> batches.
> >
> > Perhaps there is an abstraction that could be used by a greater diversity
> > of incremental processes.  Such a thing could support a nice "data
> > contiguity view". I imagine a horizontal bar that is solid where we have
> > the data and empty where we don't.  Then you click on a "missing" section
> > and you can  trigger a backfill task with that date interval according to
> > your partitioning rules.
> >
> > I can imagine using this for an incremental job where each time we pull
> the
> > new data since last time; in the `execute` method the operator could set
> > `self.high_watermark` with the max datetime processed.  Or maybe a
> callback
> > function could be used to gather this value.  This value could be used in
> > next run, and cold be depicted in a view.
> >
> > Default intervals of interest could be status quo -- i.e. partitions
> equal
> > to schedule interval -- but could be overwritten using templating or
> > callbacks or setting it during `execute`.
> >
> > So anyway, I don't have a master plan all figured out.  But I think there
> > is opportunity in this area for more comprehensive enhancement that goes
> > more directly at the root of the problem.
> >
> >
> >
> >
> > On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > How about an alternative approach that would introduce 2 new keyword
> > > arguments that are clear (something like, but maybe better than
> > > `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
> > > unchanged, but plan it's deprecation. As a first step `execution_date`
> > > would be inferred from the new args, and warn about deprecation when
> > used.
> > >
> > > Max
> > >
> > > On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bdbruin@gmail.com>
> > wrote:
> > >
> > > > Execution date is execution date for a dag run no matter what. There
> is
> > > no
> > > > end interval or start interval for a dag run. The only time this is
> > > > relevant is when we calculate the next or previous dagrun.
> > > >
> > > > So I don't Daniels rationale makes sense (?)
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 27 Aug 2019, at 17:40, Philippe Gagnon <philgagnon1@gmail.com>
> > > wrote:
> > > > >
> > > > > I agree with Daniel's rationale but I am also worried about
> backwards
> > > > > compatibility as this would perhaps be the most disruptive breaking
> > > > change
> > > > > possible. I think maybe we should write down the different options
> > > > > available to us (AIP?) and call for a vote. What does everyone
> think?
> > > > >
> > > > >> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcoder01@gmail.com>
> > > wrote:
> > > > >>
> > > > >> Can't execution date can already mean different things depending
> on
> > if
> > > > the
> > > > >> dag run was initiated via the scheduler or manually via command
> > > > line/API?
> > > > >> I agree that making it consistent might make it easier to explain
> to
> > > new
> > > > >> users, but should we exchange that for breaking pretty much every
> > > > existing
> > > > >> dag by re-defining what execution date is?
> > > > >> -James
> > > > >>
> > > > >> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
> > > dpstandish@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>>>
> > > > >>>> To Daniel’s concerns, I would argue this is not a change
to
> what a
> > > dag
> > > > >>> run
> > > > >>>> is, it is rather a change to WHEN that dag run will be
> scheduled.
> > > > >>>
> > > > >>>
> > > > >>> Execution date is part of the definition of a dag_run; it
is
> > uniquely
> > > > >>> identified by an execution_date and dag_id.
> > > > >>>
> > > > >>> When someone asks what is a dag_run, we should be able to
provide
> > an
> > > > >>> answer.
> > > > >>>
> > > > >>> Imagine trying to explain what a dag run is, when execution_date
> > can
> > > > mean
> > > > >>> different things.
> > > > >>>    Admin: "A dag run is an execution_date and a dag_id".
> > > > >>>    New user: "Ok. Clear as a bell. What's an execution_date?"
> > > > >>>    Admin: "Well, it can be one of two things.  It *could*
be when
> > the
> > > > >> dag
> > > > >>> will be run... but it could *also* be 'the time when dag
should
> be
> > > run
> > > > >>> minus one schedule interval".  It depends on whether you
choose
> > 'end'
> > > > or
> > > > >>> 'start' for 'schedule_interval_edge.'  If you choose 'start'
then
> > > > >>> execution_date means 'when dag will be run'.  If you choose
'end'
> > > then
> > > > >>> execution_date means 'when dag will be run minus one interval.'
> If
> > > you
> > > > >>> change the parameter after some time, then we don't necessarily
> > know
> > > > what
> > > > >>> it means at all times".
> > > > >>>
> > > > >>> Why would we do this to ourselves?
> > > > >>>
> > > > >>> Alternatively, we can give dag_run a clear, unambiguous meaning:
> > > > >>> * dag_run is dag_id + execution_date
> > > > >>> * execution_date is when dag will be run (notwithstanding
> scheduler
> > > > >> delay,
> > > > >>> queuing)
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Execution_date is defined as "run-at date minus 1 interval".
 The
> > > > >>> assumption in this is that you tasks care about this particular
> > date.
> > > > >>> Obviously this makes sense for some tasks but not for others.
> > > > >>>
> > > > >>> I would prop
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcoder01@gmail.com
> >
> > > > wrote:
> > > > >>>>
> > > > >>>> I think this is a great improvement and should be merged.
To
> > > Daniel’s
> > > > >>>> concerns, I would argue this is not a change to what
a dag run
> is,
> > > it
> > > > >> is
> > > > >>>> rather a change to WHEN that dag run will be scheduled.
> > > > >>>> I had implemented a similar change in my own version
but
> > ultimately
> > > > >>> backed
> > > > >>>> so I didn’t have to patch after each new release. In
my opinion
> > the
> > > > >> main
> > > > >>>> flaw in the current scheduler, and I have brought this
up
> before,
> > is
> > > > >> when
> > > > >>>> you don’t have a consistent schedule interval (e.g.
only run
> M-F).
> > > > >> After
> > > > >>>> backing out the “schedule at interval start” I had
to switch to
> a
> > > > daily
> > > > >>>> schedule and go through and put a short circuit operator
in each
> > of
> > > my
> > > > >>> M-F
> > > > >>>> dags to get the behavior that I wanted. This results
in putting
> > > > >>> scheduling
> > > > >>>> logic inside the dag, when scheduling logic should be
in the
> > > > scheduler.
> > > > >>>>
> > > > >>>> -James
> > > > >>>>
> > > > >>>>
> > > > >>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <
> > dpstandish@gmail.com
> > > >
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> Re
> > > > >>>>>
> > > > >>>>>> What are people's feelings on changing the default
execution
> to
> > > > >>> schedule
> > > > >>>>>> interval start
> > > > >>>>>
> > > > >>>>> and
> > > > >>>>>
> > > > >>>>>> I'm in favor of doing that, but then exposing
new variables of
> > > > >>>>>> "interval_start" and "interval_end", etc. so
that people write
> > > > >>>>>> clearer-looking at-a-glance DAGs
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> While I am def on board with the spirit of this PR,
I would
> vote
> > we
> > > > >> do
> > > > >>>> not
> > > > >>>>> accept this PR as is, because it cements a confusing
option.
> > > > >>>>>
> > > > >>>>> *What is the right representation of a dag run?*
> > > > >>>>>
> > > > >>>>> Right now the representation is "dag run-at date
minus 1
> > interval".
> > > > >> It
> > > > >>>>> should just be "dag run-at date".
> > > > >>>>>
> > > > >>>>> We don't need to address the question of whether
execution date
> > is
> > > > >> the
> > > > >>>>> start or the end of an interval; it doesn't matter.
> > > > >>>>>
> > > > >>>>> In all cases, a given dag run will be targeted for
*some*
> initial
> > > > >>> "run-at
> > > > >>>>> time"; so *that* should be the time that is part
of the PK of a
> > dag
> > > > >>> run,
> > > > >>>>> and *that *is the time that should be exposed as
the dag run
> > > > >> "execution
> > > > >>>>> date"
> > > > >>>>>
> > > > >>>>> *Interval of interest is not a dag_run attribute*
> > > > >>>>>
> > > > >>>>> We also mix in this question of the date interval
that the
> > *tasks*
> > > > >> are
> > > > >>>>> interested in.  But the *dag run* need not concern
itself with
> > this
> > > > >> in
> > > > >>>> any
> > > > >>>>> way.  That is for the tasks to figure out: if they
happen to
> need
> > > > >> "dag
> > > > >>>>> run-at date," then they can reference that; if they
want the
> > prior
> > > > >> one,
> > > > >>>> ask
> > > > >>>>> for the prior one.
> > > > >>>>>
> > > > >>>>> Previously, I was in the camp that thought it was
a great idea
> to
> > > > >>> rename
> > > > >>>>> "execution_date" to "period_start" or "interval_start".
 But I
> > now
> > > > >>> think
> > > > >>>>> this is folly.  It invokes this question of the "interval
of
> > > > >> interest"
> > > > >>> or
> > > > >>>>> "period of interest".  But the dag doesn't need to
know
> anything
> > > > >> about
> > > > >>>>> that.
> > > > >>>>>
> > > > >>>>> Within the same dag you may have tasks with different
intervals
> > of
> > > > >>>>> interest.  So why make assumptions in the dag; just
give the
> > facts:
> > > > >>> this
> > > > >>>> is
> > > > >>>>> my run date; this is the prior run date, etc.  It
would be a
> > > > >> regression
> > > > >>>>> from the perspective of providing accurate names.
> > > > >>>>>
> > > > >>>>> *Proposal*
> > > > >>>>>
> > > > >>>>> So, I would propose we change "execution_date" to
mean "dag
> > run-at
> > > > >>> date"
> > > > >>>> as
> > > > >>>>> opposed to "dag run-at date minus 1".  But we should
do so
> > without
> > > > >>>>> reference to interval end or interval start.
> > > > >>>>>
> > > > >>>>> *Configurability*
> > > > >>>>>
> > > > >>>>> The more configuration options we have, the more
noise there is
> > as
> > > a
> > > > >>> user
> > > > >>>>> trying to understand how to use airflow, so I'd rather
us not
> > make
> > > > >> this
> > > > >>>>> configurable at all.
> > > > >>>>>
> > > > >>>>> That said, perhaps a more clear and more explicit
means making
> > this
> > > > >>>>> configurable would be to define an integer param
> > > > >>>>> "dag_run_execution_date_interval_offset", which would
control
> how
> > > > >> many
> > > > >>>>> intervals back from actual "dag run-at date" the
"execution
> date"
> > > > >>> should
> > > > >>>>> be.  (current behavior = 1, new behavior = 0).
> > > > >>>>>
> > > > >>>>> *Side note*
> > > > >>>>>
> > > > >>>>> Hopefully not to derail discussion: I think there
are
> additional,
> > > > >>> related
> > > > >>>>> task attributes that may want to come into being:
namely,
> > > > >> low_watermark
> > > > >>>> and
> > > > >>>>> high_watermark.  There is the potential, with attributes
like
> > this,
> > > > >> for
> > > > >>>>> adding better out-of-the-box support for common data
workflows
> > that
> > > > >> we
> > > > >>>> now
> > > > >>>>> need to use xcom for, namely incremental loads. 
But I want to
> > give
> > > > >> it
> > > > >>>> more
> > > > >>>>> thought before proposing anything specific.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
> > > > >> Jarek.Potiuk@polidea.com
> > > > >>>>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Good one Damian. I will have a list of issues
that can be
> > possible
> > > > >> to
> > > > >>>>>> handle at the workshop, so that one goes there.
> > > > >>>>>>
> > > > >>>>>> J.
> > > > >>>>>>
> > > > >>>>>> Principal Software Engineer
> > > > >>>>>> Phone: +48660796129
> > > > >>>>>>
> > > > >>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian
P. <
> > > > >>>>>> damian.shaw.2@credit-suisse.com> napisał:
> > > > >>>>>>
> > > > >>>>>>> I can't understate what a conceptual improvement
this would
> be
> > > for
> > > > >>> the
> > > > >>>>>> end
> > > > >>>>>>> users of Airflow in our environment. I've
written a lot of
> code
> > > so
> > > > >>> all
> > > > >>>>>> our
> > > > >>>>>>> configuration works like this anyway. But
the UI still shows
> > the
> > > > >>>> Airflow
> > > > >>>>>>> dates which still to this day sometimes confuse
me.
> > > > >>>>>>>
> > > > >>>>>>> I'll be at the NY meet ups on Monday and
Tuesday, maybe some
> of
> > > my
> > > > >>>> first
> > > > >>>>>>> PRs could be additional test cases around
edge cases to do
> with
> > > DST
> > > > >>> and
> > > > >>>>>>> cron scheduling that I have concerns about
:)
> > > > >>>>>>>
> > > > >>>>>>> Damian
> > > > >>>>>>>
> > > > >>>>>>> -----Original Message-----
> > > > >>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
> > > > >>>>>>> Sent: Friday, August 23, 2019 6:50 AM
> > > > >>>>>>> To: dev@airflow.apache.org
> > > > >>>>>>> Subject: Setting to add choice of schedule
at end or schedule
> > at
> > > > >>> start
> > > > >>>> of
> > > > >>>>>>> interval
> > > > >>>>>>>
> > > > >>>>>>> This has come up a few times before, someone
has now opened a
> > PR
> > > > >> that
> > > > >>>>>>> makes this a global+per-dag setting:
> > > > >>>>>>> https://github.com/apache/airflow/pull/5787
and it also
> > includes
> > > > >>> docs
> > > > >>>>>>> that I think does a good job of illustrating
the two modes.
> > > > >>>>>>>
> > > > >>>>>>> Does anyone object to this being merged?
If no one says
> > anything
> > > by
> > > > >>>>>> midday
> > > > >>>>>>> on Tuesday I will take that as assent and
will merge it.
> > > > >>>>>>>
> > > > >>>>>>> The docs from the PR included below.
> > > > >>>>>>>
> > > > >>>>>>> Thanks,
> > > > >>>>>>> Ash
> > > > >>>>>>>
> > > > >>>>>>> Scheduled Time vs Execution Time
> > > > >>>>>>> ''''''''''''''''''''''''''''''''
> > > > >>>>>>>
> > > > >>>>>>> A DAG with a ``schedule_interval`` will execute
once per
> > > interval.
> > > > >> By
> > > > >>>>>>> default, the execution of a DAG will occur
at the **end** of
> > the
> > > > >>>>>>> schedule interval.
> > > > >>>>>>>
> > > > >>>>>>> A few examples:
> > > > >>>>>>>
> > > > >>>>>>> - A DAG with ``schedule_interval='@hourly'``:
The DAG run
> that
> > > > >>>> processes
> > > > >>>>>>> 2019-08-16 17:00 will start running just
after 2019-08-16
> > > 17:59:59,
> > > > >>>>>>> i.e. once that hour is over.
> > > > >>>>>>> - A DAG with ``schedule_interval='@daily'``:
The DAG run that
> > > > >>> processes
> > > > >>>>>>> 2019-08-16 will start running shortly after
2019-08-17 00:00.
> > > > >>>>>>>
> > > > >>>>>>> The reasoning behind this execution vs scheduling
behaviour
> is
> > > that
> > > > >>>>>>> data for the interval to be processed won't
be fully
> available
> > > > >> until
> > > > >>>>>>> the interval has elapsed.
> > > > >>>>>>>
> > > > >>>>>>> In cases where you wish the DAG to be executed
at the
> **start**
> > > of
> > > > >>> the
> > > > >>>>>>> interval, specify ``schedule_at_interval_end=False``,
either
> in
> > > > >>>>>>> ``airflow.cfg``, or on a per-DAG basis.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> ===============================================================================
> > > > >>>>>>>
> > > > >>>>>>> Please access the attached hyperlink for
an important
> > electronic
> > > > >>>>>>> communications disclaimer:
> > > > >>>>>>>
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> ===============================================================================
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message