airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: Setting to add choice of schedule at end or schedule at start of interval
Date Fri, 06 Sep 2019 05:40:24 GMT
Just had a thought and looked a tiny bit at the source code to assess
feasibility, but it seems like you could just derive the DAG class and
override `previous_schedule` and `following_schedule` methods. The
signature of both is you get a `datetime.datetime` and have to return
another. It's pretty easy to put your arbitrarily complex logic in there.

There may be a few hiccups to sort out things like like
`airflow.utils.dates.date_range` (where duplicated time-step logic exist)
to make sure that all time-step logic aligns with these two methods I just
mentioned, but from that point it could be become the official way to
incorporate funky date-step logic.

Max

On Wed, Sep 4, 2019 at 12:54 PM Daniel Standish <dpstandish@gmail.com>
wrote:

> Re:
>
> >  For example, if I need to run a DAG every 20 minutes between 8 AM and 4
> > PM...
>
>
> This makes a lot of sense!  Thank you for providing this example.  My
> initial thought of course is "well can't you just set it to run */20
> between 7:40am and 3:40pm," but I don't think that is possible in cron.
> Which is why you have to do hacky shit as you've said and it indeed sounds
> terrible.  I never had to achieve a schedule like this, and yeah -- it
> should not be this hard.
>
> Re:
>
> > I can’t see how adding a property to Dagrun that is essentially
> > identical to next_execution_date would add any benefit.
>
> That's why i was like what the hell is the point of this thing!   I thought
> it was just purely cosmetic, so that in effect "execution_date" would
> optionally mean "run_date".
>
>
>
>
>
>
>
> On Wed, Sep 4, 2019 at 12:10 PM James Coder <jcoder01@gmail.com> wrote:
>
> > I can’t see how adding a property to Dagrun that is essentially
> > identical to next_execution_date would add any benefit. The way I see
> > it the issue at hand here is not the availability of dates. There are
> > plenty of options in the template context for dates before and after
> > execution date. My view point is the problem this is trying to solve
> > is that waiting until the right edge of an interval has passed to
> > schedule a dag run has some shortcomings. Mainly that if your
> > intervals vary in length you are forced to put scheduling logic that
> > should reside in the scheduler in your DAGs. For example, if I need to
> > run a DAG every 20 minutes between 8 AM and 4 PM, in it's current
> > form, the scheduler won't schedule that 4PM run until 8 AM the next
> > day. "Just use next_execution_date" you say, well that's all well and
> > good between 8AM and 3:40 PM, but when 4:01 PM rolls around and you
> > don't have the results because they won't be available until after 8
> > the next day, that doesn't sound so good, does it? In order to work
> > around this, you have to add additional runs and short circuit
> > operators over and over. It's a hassle.  Allowing for scheduling dags
> > at the left edge of an interval and allowing it to behave more like
> > cron, where it runs at the time specified, not schedule + interval,
> > would make things much less complicated for users like myself that
> > can't always wait until the right edge of the interval.
> >
> >
> > James Coder
> >
> > > On Sep 3, 2019, at 11:14 PM, Daniel Standish <dpstandish@gmail.com>
> > wrote:
> > >
> > > What if we merely add a property "run_date" to DagRun?  At present
> > > this would be essentially same as "next_execution_date".
> > >
> > > Then no change to scheduler would be required, and no new dag parameter
> > or
> > > config.  Perhaps you could add a toggle to the DAGs UI view that lets
> you
> > > choose whether to display "last run" by "run_date" or "execution_date".
> > >
> > > If you want your dags to be parameterized by the date when they meant
> to
> > be
> > > run -- as opposed to their implicit interval-of-interest -- then you
> can
> > > reference "run_date".
> > >
> > > One potential source of confusion with this is backfilling: what does
> > > "run_date" mean in the context of a backfill?  You could say it means
> > > essentially "initial run date", i.e. "do not run before date", i.e.
> "run
> > > after date" or "run-at date".  So, for a daily, job the 2019-01-02
> > > "run_date" corresponds to a 2019-01-01 execution_date.  This makes
> sense
> > > right?
> > >
> > > Perhaps in the future, the relationship between "run_date" and
> > > "execution_date" can be more dynamic.  Perhaps in the future we rename
> > > "execution_date" for clarity, or to be more generic.  But it makes
> sense
> > > that a dag run will always have a run date, so it doesn't seem like a
> > > terrible idea to add a property representing this.
> > >
> > > Would this meet the goals of the PR?
> > >
> > >
> > >
> > >
> > > On Wed, Aug 28, 2019 at 11:50 AM James Meickle
> > > <jmeickle@quantopian.com.invalid> wrote:
> > >
> > >> Totally agree with Daniel here. I think that if we implement this
> > feature
> > >> as proposed, it will actively discourage us from implementing a better
> > >> data-aware feature that would remain invisible to most users while
> > neatly
> > >> addressing a lot of edge cases that currently require really ugly
> > hacks. I
> > >> believe that having more data awareness features in Airflow (like the
> > data
> > >> lineage work, or other metadata integrations) is worth investing in if
> > we
> > >> can do it without too much required user-facing complexity. The
> Airflow
> > >> project isn't a full data warehouse suite but it's also not just "cron
> > with
> > >> a UI", so we should try to be pragmatic and fit in power-user features
> > >> where we can do so without compromising the project's overall goals.
> > >>
> > >> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dpstandish@gmail.com
> >
> > >> wrote:
> > >>
> > >>> I am just thinking there is the potential for a more comprehensive
> > >>> enhancement here, and I worry that this is a band-aid that, like all
> > new
> > >>> features has the potential to constrain future options.  It does not
> > help
> > >>> us to do anything we cannot already do.
> > >>>
> > >>> The source of this problem is that scheduling and
> interval-of-interest
> > >> are
> > >>> mixed together.
> > >>>
> > >>> My thought is there may be a way to separate scheduling and
> > >>> interval-of-interest to uniformly resolve "execution_date" vs
> > "run_date"
> > >>> confusion.  We could make *explicit* instead of *implicit* the
> > >> relationship
> > >>> between run_date *(not currently a concept in airflow)* and
> > >>> "interval-of-interest" *(currently represented by execution_date)*.
> > >>>
> > >>> I also see in this the potential to unlock some other improvements:
> > >>> * support a greater diversity of incremental processes
> > >>> * allow more flexible backfilling
> > >>> * provide better views of data you have vs data you don't.
> > >>>
> > >>> The canonical airflow job is date-partitioned idempotent data pull.
> > Your
> > >>> interval of interest is from execution_date to execution_date + 1
> > >>> interval.  Schedule_interval is not just the scheduling cadence but
> it
> > is
> > >>> also your interval-of-interest partition function.   If that doesn't
> > work
> > >>> for your job, you set catchup=False and roll your own.
> > >>>
> > >>> What if there was a way to generalize?  E.g. could we allow for more
> > >>> flexible partition function that deviated from scheduler cadence?
> E.g.
> > >>> what if your interval-of-interest partitions could be governed by
> "min
> > 1
> > >>> day, max 30 days".  Then on on-going basis, your daily loads would
> be a
> > >>> range of 1 day but then if server down for couple days, this could
be
> > >>> caught up in one task and if you backfill it could be up to 30-day
> > >> batches.
> > >>>
> > >>> Perhaps there is an abstraction that could be used by a greater
> > diversity
> > >>> of incremental processes.  Such a thing could support a nice "data
> > >>> contiguity view". I imagine a horizontal bar that is solid where we
> > have
> > >>> the data and empty where we don't.  Then you click on a "missing"
> > section
> > >>> and you can  trigger a backfill task with that date interval
> according
> > to
> > >>> your partitioning rules.
> > >>>
> > >>> I can imagine using this for an incremental job where each time we
> pull
> > >> the
> > >>> new data since last time; in the `execute` method the operator could
> > set
> > >>> `self.high_watermark` with the max datetime processed.  Or maybe a
> > >> callback
> > >>> function could be used to gather this value.  This value could be
> used
> > in
> > >>> next run, and cold be depicted in a view.
> > >>>
> > >>> Default intervals of interest could be status quo -- i.e. partitions
> > >> equal
> > >>> to schedule interval -- but could be overwritten using templating or
> > >>> callbacks or setting it during `execute`.
> > >>>
> > >>> So anyway, I don't have a master plan all figured out.  But I think
> > there
> > >>> is opportunity in this area for more comprehensive enhancement that
> > goes
> > >>> more directly at the root of the problem.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
> > >>> maximebeauchemin@gmail.com> wrote:
> > >>>
> > >>>> How about an alternative approach that would introduce 2 new keyword
> > >>>> arguments that are clear (something like, but maybe better than
> > >>>> `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
> > >>>> unchanged, but plan it's deprecation. As a first step
> `execution_date`
> > >>>> would be inferred from the new args, and warn about deprecation
when
> > >>> used.
> > >>>>
> > >>>> Max
> > >>>>
> > >>>> On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bdbruin@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> Execution date is execution date for a dag run no matter what.
> There
> > >> is
> > >>>> no
> > >>>>> end interval or start interval for a dag run. The only time
this is
> > >>>>> relevant is when we calculate the next or previous dagrun.
> > >>>>>
> > >>>>> So I don't Daniels rationale makes sense (?)
> > >>>>>
> > >>>>> Sent from my iPhone
> > >>>>>
> > >>>>>> On 27 Aug 2019, at 17:40, Philippe Gagnon <philgagnon1@gmail.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>> I agree with Daniel's rationale but I am also worried about
> > >> backwards
> > >>>>>> compatibility as this would perhaps be the most disruptive
> breaking
> > >>>>> change
> > >>>>>> possible. I think maybe we should write down the different
options
> > >>>>>> available to us (AIP?) and call for a vote. What does everyone
> > >> think?
> > >>>>>>
> > >>>>>>> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcoder01@gmail.com>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>> Can't execution date can already mean different things
depending
> > >> on
> > >>> if
> > >>>>> the
> > >>>>>>> dag run was initiated via the scheduler or manually
via command
> > >>>>> line/API?
> > >>>>>>> I agree that making it consistent might make it easier
to explain
> > >> to
> > >>>> new
> > >>>>>>> users, but should we exchange that for breaking pretty
much every
> > >>>>> existing
> > >>>>>>> dag by re-defining what execution date is?
> > >>>>>>> -James
> > >>>>>>>
> > >>>>>>> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
> > >>>> dpstandish@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> To Daniel’s concerns, I would argue this
is not a change to
> > >> what a
> > >>>> dag
> > >>>>>>>> run
> > >>>>>>>>> is, it is rather a change to WHEN that dag
run will be
> > >> scheduled.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Execution date is part of the definition of a dag_run;
it is
> > >>> uniquely
> > >>>>>>>> identified by an execution_date and dag_id.
> > >>>>>>>>
> > >>>>>>>> When someone asks what is a dag_run, we should
be able to
> provide
> > >>> an
> > >>>>>>>> answer.
> > >>>>>>>>
> > >>>>>>>> Imagine trying to explain what a dag run is, when
execution_date
> > >>> can
> > >>>>> mean
> > >>>>>>>> different things.
> > >>>>>>>>   Admin: "A dag run is an execution_date and a
dag_id".
> > >>>>>>>>   New user: "Ok. Clear as a bell. What's an execution_date?"
> > >>>>>>>>   Admin: "Well, it can be one of two things.  It
*could* be when
> > >>> the
> > >>>>>>> dag
> > >>>>>>>> will be run... but it could *also* be 'the time
when dag should
> > >> be
> > >>>> run
> > >>>>>>>> minus one schedule interval".  It depends on whether
you choose
> > >>> 'end'
> > >>>>> or
> > >>>>>>>> 'start' for 'schedule_interval_edge.'  If you choose
'start'
> then
> > >>>>>>>> execution_date means 'when dag will be run'.  If
you choose
> 'end'
> > >>>> then
> > >>>>>>>> execution_date means 'when dag will be run minus
one interval.'
> > >> If
> > >>>> you
> > >>>>>>>> change the parameter after some time, then we don't
necessarily
> > >>> know
> > >>>>> what
> > >>>>>>>> it means at all times".
> > >>>>>>>>
> > >>>>>>>> Why would we do this to ourselves?
> > >>>>>>>>
> > >>>>>>>> Alternatively, we can give dag_run a clear, unambiguous
meaning:
> > >>>>>>>> * dag_run is dag_id + execution_date
> > >>>>>>>> * execution_date is when dag will be run (notwithstanding
> > >> scheduler
> > >>>>>>> delay,
> > >>>>>>>> queuing)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Execution_date is defined as "run-at date minus
1 interval".
> The
> > >>>>>>>> assumption in this is that you tasks care about
this particular
> > >>> date.
> > >>>>>>>> Obviously this makes sense for some tasks but not
for others.
> > >>>>>>>>
> > >>>>>>>> I would prop
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder
<
> jcoder01@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> I think this is a great improvement and should
be merged. To
> > >>>> Daniel’s
> > >>>>>>>>> concerns, I would argue this is not a change
to what a dag run
> > >> is,
> > >>>> it
> > >>>>>>> is
> > >>>>>>>>> rather a change to WHEN that dag run will be
scheduled.
> > >>>>>>>>> I had implemented a similar change in my own
version but
> > >>> ultimately
> > >>>>>>>> backed
> > >>>>>>>>> so I didn’t have to patch after each new
release. In my opinion
> > >>> the
> > >>>>>>> main
> > >>>>>>>>> flaw in the current scheduler, and I have brought
this up
> > >> before,
> > >>> is
> > >>>>>>> when
> > >>>>>>>>> you don’t have a consistent schedule interval
(e.g. only run
> > >> M-F).
> > >>>>>>> After
> > >>>>>>>>> backing out the “schedule at interval start”
I had to switch to
> > >> a
> > >>>>> daily
> > >>>>>>>>> schedule and go through and put a short circuit
operator in
> each
> > >>> of
> > >>>> my
> > >>>>>>>> M-F
> > >>>>>>>>> dags to get the behavior that I wanted. This
results in putting
> > >>>>>>>> scheduling
> > >>>>>>>>> logic inside the dag, when scheduling logic
should be in the
> > >>>>> scheduler.
> > >>>>>>>>>
> > >>>>>>>>> -James
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish
<
> > >>> dpstandish@gmail.com
> > >>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Re
> > >>>>>>>>>>
> > >>>>>>>>>>> What are people's feelings on changing
the default execution
> > >> to
> > >>>>>>>> schedule
> > >>>>>>>>>>> interval start
> > >>>>>>>>>>
> > >>>>>>>>>> and
> > >>>>>>>>>>
> > >>>>>>>>>>> I'm in favor of doing that, but then
exposing new variables
> of
> > >>>>>>>>>>> "interval_start" and "interval_end",
etc. so that people
> write
> > >>>>>>>>>>> clearer-looking at-a-glance DAGs
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> While I am def on board with the spirit
of this PR, I would
> > >> vote
> > >>> we
> > >>>>>>> do
> > >>>>>>>>> not
> > >>>>>>>>>> accept this PR as is, because it cements
a confusing option.
> > >>>>>>>>>>
> > >>>>>>>>>> *What is the right representation of a
dag run?*
> > >>>>>>>>>>
> > >>>>>>>>>> Right now the representation is "dag run-at
date minus 1
> > >>> interval".
> > >>>>>>> It
> > >>>>>>>>>> should just be "dag run-at date".
> > >>>>>>>>>>
> > >>>>>>>>>> We don't need to address the question of
whether execution
> date
> > >>> is
> > >>>>>>> the
> > >>>>>>>>>> start or the end of an interval; it doesn't
matter.
> > >>>>>>>>>>
> > >>>>>>>>>> In all cases, a given dag run will be targeted
for *some*
> > >> initial
> > >>>>>>>> "run-at
> > >>>>>>>>>> time"; so *that* should be the time that
is part of the PK of
> a
> > >>> dag
> > >>>>>>>> run,
> > >>>>>>>>>> and *that *is the time that should be exposed
as the dag run
> > >>>>>>> "execution
> > >>>>>>>>>> date"
> > >>>>>>>>>>
> > >>>>>>>>>> *Interval of interest is not a dag_run
attribute*
> > >>>>>>>>>>
> > >>>>>>>>>> We also mix in this question of the date
interval that the
> > >>> *tasks*
> > >>>>>>> are
> > >>>>>>>>>> interested in.  But the *dag run* need
not concern itself with
> > >>> this
> > >>>>>>> in
> > >>>>>>>>> any
> > >>>>>>>>>> way.  That is for the tasks to figure out:
if they happen to
> > >> need
> > >>>>>>> "dag
> > >>>>>>>>>> run-at date," then they can reference that;
if they want the
> > >>> prior
> > >>>>>>> one,
> > >>>>>>>>> ask
> > >>>>>>>>>> for the prior one.
> > >>>>>>>>>>
> > >>>>>>>>>> Previously, I was in the camp that thought
it was a great idea
> > >> to
> > >>>>>>>> rename
> > >>>>>>>>>> "execution_date" to "period_start" or "interval_start".
 But I
> > >>> now
> > >>>>>>>> think
> > >>>>>>>>>> this is folly.  It invokes this question
of the "interval of
> > >>>>>>> interest"
> > >>>>>>>> or
> > >>>>>>>>>> "period of interest".  But the dag doesn't
need to know
> > >> anything
> > >>>>>>> about
> > >>>>>>>>>> that.
> > >>>>>>>>>>
> > >>>>>>>>>> Within the same dag you may have tasks
with different
> intervals
> > >>> of
> > >>>>>>>>>> interest.  So why make assumptions in the
dag; just give the
> > >>> facts:
> > >>>>>>>> this
> > >>>>>>>>> is
> > >>>>>>>>>> my run date; this is the prior run date,
etc.  It would be a
> > >>>>>>> regression
> > >>>>>>>>>> from the perspective of providing accurate
names.
> > >>>>>>>>>>
> > >>>>>>>>>> *Proposal*
> > >>>>>>>>>>
> > >>>>>>>>>> So, I would propose we change "execution_date"
to mean "dag
> > >>> run-at
> > >>>>>>>> date"
> > >>>>>>>>> as
> > >>>>>>>>>> opposed to "dag run-at date minus 1". 
But we should do so
> > >>> without
> > >>>>>>>>>> reference to interval end or interval start.
> > >>>>>>>>>>
> > >>>>>>>>>> *Configurability*
> > >>>>>>>>>>
> > >>>>>>>>>> The more configuration options we have,
the more noise there
> is
> > >>> as
> > >>>> a
> > >>>>>>>> user
> > >>>>>>>>>> trying to understand how to use airflow,
so I'd rather us not
> > >>> make
> > >>>>>>> this
> > >>>>>>>>>> configurable at all.
> > >>>>>>>>>>
> > >>>>>>>>>> That said, perhaps a more clear and more
explicit means making
> > >>> this
> > >>>>>>>>>> configurable would be to define an integer
param
> > >>>>>>>>>> "dag_run_execution_date_interval_offset",
which would control
> > >> how
> > >>>>>>> many
> > >>>>>>>>>> intervals back from actual "dag run-at
date" the "execution
> > >> date"
> > >>>>>>>> should
> > >>>>>>>>>> be.  (current behavior = 1, new behavior
= 0).
> > >>>>>>>>>>
> > >>>>>>>>>> *Side note*
> > >>>>>>>>>>
> > >>>>>>>>>> Hopefully not to derail discussion: I think
there are
> > >> additional,
> > >>>>>>>> related
> > >>>>>>>>>> task attributes that may want to come into
being: namely,
> > >>>>>>> low_watermark
> > >>>>>>>>> and
> > >>>>>>>>>> high_watermark.  There is the potential,
with attributes like
> > >>> this,
> > >>>>>>> for
> > >>>>>>>>>> adding better out-of-the-box support for
common data workflows
> > >>> that
> > >>>>>>> we
> > >>>>>>>>> now
> > >>>>>>>>>> need to use xcom for, namely incremental
loads.  But I want to
> > >>> give
> > >>>>>>> it
> > >>>>>>>>> more
> > >>>>>>>>>> thought before proposing anything specific.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk
<
> > >>>>>>> Jarek.Potiuk@polidea.com
> > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Good one Damian. I will have a list
of issues that can be
> > >>> possible
> > >>>>>>> to
> > >>>>>>>>>>> handle at the workshop, so that one
goes there.
> > >>>>>>>>>>>
> > >>>>>>>>>>> J.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Principal Software Engineer
> > >>>>>>>>>>> Phone: +48660796129
> > >>>>>>>>>>>
> > >>>>>>>>>>> pt., 23 sie 2019, 11:09 użytkownik
Shaw, Damian P. <
> > >>>>>>>>>>> damian.shaw.2@credit-suisse.com>
napisał:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I can't understate what a conceptual
improvement this would
> > >> be
> > >>>> for
> > >>>>>>>> the
> > >>>>>>>>>>> end
> > >>>>>>>>>>>> users of Airflow in our environment.
I've written a lot of
> > >> code
> > >>>> so
> > >>>>>>>> all
> > >>>>>>>>>>> our
> > >>>>>>>>>>>> configuration works like this anyway.
But the UI still shows
> > >>> the
> > >>>>>>>>> Airflow
> > >>>>>>>>>>>> dates which still to this day sometimes
confuse me.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'll be at the NY meet ups on Monday
and Tuesday, maybe some
> > >> of
> > >>>> my
> > >>>>>>>>> first
> > >>>>>>>>>>>> PRs could be additional test cases
around edge cases to do
> > >> with
> > >>>> DST
> > >>>>>>>> and
> > >>>>>>>>>>>> cron scheduling that I have concerns
about :)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Damian
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
> > >>>>>>>>>>>> Sent: Friday, August 23, 2019 6:50
AM
> > >>>>>>>>>>>> To: dev@airflow.apache.org
> > >>>>>>>>>>>> Subject: Setting to add choice
of schedule at end or
> schedule
> > >>> at
> > >>>>>>>> start
> > >>>>>>>>> of
> > >>>>>>>>>>>> interval
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This has come up a few times before,
someone has now opened
> a
> > >>> PR
> > >>>>>>> that
> > >>>>>>>>>>>> makes this a global+per-dag setting:
> > >>>>>>>>>>>> https://github.com/apache/airflow/pull/5787
and it also
> > >>> includes
> > >>>>>>>> docs
> > >>>>>>>>>>>> that I think does a good job of
illustrating the two modes.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Does anyone object to this being
merged? If no one says
> > >>> anything
> > >>>> by
> > >>>>>>>>>>> midday
> > >>>>>>>>>>>> on Tuesday I will take that as
assent and will merge it.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The docs from the PR included below.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>> Ash
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Scheduled Time vs Execution Time
> > >>>>>>>>>>>> ''''''''''''''''''''''''''''''''
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A DAG with a ``schedule_interval``
will execute once per
> > >>>> interval.
> > >>>>>>> By
> > >>>>>>>>>>>> default, the execution of a DAG
will occur at the **end** of
> > >>> the
> > >>>>>>>>>>>> schedule interval.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A few examples:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - A DAG with ``schedule_interval='@hourly'``:
The DAG run
> > >> that
> > >>>>>>>>> processes
> > >>>>>>>>>>>> 2019-08-16 17:00 will start running
just after 2019-08-16
> > >>>> 17:59:59,
> > >>>>>>>>>>>> i.e. once that hour is over.
> > >>>>>>>>>>>> - A DAG with ``schedule_interval='@daily'``:
The DAG run
> that
> > >>>>>>>> processes
> > >>>>>>>>>>>> 2019-08-16 will start running shortly
after 2019-08-17
> 00:00.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The reasoning behind this execution
vs scheduling behaviour
> > >> is
> > >>>> that
> > >>>>>>>>>>>> data for the interval to be processed
won't be fully
> > >> available
> > >>>>>>> until
> > >>>>>>>>>>>> the interval has elapsed.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In cases where you wish the DAG
to be executed at the
> > >> **start**
> > >>>> of
> > >>>>>>>> the
> > >>>>>>>>>>>> interval, specify ``schedule_at_interval_end=False``,
either
> > >> in
> > >>>>>>>>>>>> ``airflow.cfg``, or on a per-DAG
basis.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> ===============================================================================
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Please access the attached hyperlink
for an important
> > >>> electronic
> > >>>>>>>>>>>> communications disclaimer:
> > >>>>>>>>>>>>
> > >> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> ===============================================================================
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message