airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Coder <jcode...@gmail.com>
Subject Re: Setting to add choice of schedule at end or schedule at start of interval
Date Sat, 24 Aug 2019 12:08:39 GMT
I think this is a great improvement and should be merged. To Daniel’s concerns, I would argue
this is not a change to what a dag run is, it is rather a change to WHEN that dag run will
be scheduled. 
I had implemented a similar change in my own version but ultimately backed so I didn’t have
to patch after each new release. In my opinion the main flaw in the current scheduler, and
I have brought this up before, is when you don’t have a consistent schedule interval (e.g.
only run M-F). After backing out the “schedule at interval start” I had to switch to a
daily schedule and go through and put a short circuit operator in each of my M-F dags to get
the behavior that I wanted. This results in putting scheduling logic inside the dag, when
scheduling logic should be in the scheduler. 

-James


> On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstandish@gmail.com> wrote:
> 
> Re
> 
>> What are people's feelings on changing the default execution to schedule
>> interval start
> 
> and
> 
>> I'm in favor of doing that, but then exposing new variables of
>> "interval_start" and "interval_end", etc. so that people write
>> clearer-looking at-a-glance DAGs
> 
> 
> While I am def on board with the spirit of this PR, I would vote we do not
> accept this PR as is, because it cements a confusing option.
> 
> *What is the right representation of a dag run?*
> 
> Right now the representation is "dag run-at date minus 1 interval".  It
> should just be "dag run-at date".
> 
> We don't need to address the question of whether execution date is the
> start or the end of an interval; it doesn't matter.
> 
> In all cases, a given dag run will be targeted for *some* initial "run-at
> time"; so *that* should be the time that is part of the PK of a dag run,
> and *that *is the time that should be exposed as the dag run "execution
> date"
> 
> *Interval of interest is not a dag_run attribute*
> 
> We also mix in this question of the date interval that the *tasks* are
> interested in.  But the *dag run* need not concern itself with this in any
> way.  That is for the tasks to figure out: if they happen to need "dag
> run-at date," then they can reference that; if they want the prior one, ask
> for the prior one.
> 
> Previously, I was in the camp that thought it was a great idea to rename
> "execution_date" to "period_start" or "interval_start".  But I now think
> this is folly.  It invokes this question of the "interval of interest" or
> "period of interest".  But the dag doesn't need to know anything about
> that.
> 
> Within the same dag you may have tasks with different intervals of
> interest.  So why make assumptions in the dag; just give the facts: this is
> my run date; this is the prior run date, etc.  It would be a regression
> from the perspective of providing accurate names.
> 
> *Proposal*
> 
> So, I would propose we change "execution_date" to mean "dag run-at date" as
> opposed to "dag run-at date minus 1".  But we should do so without
> reference to interval end or interval start.
> 
> *Configurability*
> 
> The more configuration options we have, the more noise there is as a user
> trying to understand how to use airflow, so I'd rather us not make this
> configurable at all.
> 
> That said, perhaps a more clear and more explicit means making this
> configurable would be to define an integer param
> "dag_run_execution_date_interval_offset", which would control how many
> intervals back from actual "dag run-at date" the "execution date" should
> be.  (current behavior = 1, new behavior = 0).
> 
> *Side note*
> 
> Hopefully not to derail discussion: I think there are additional, related
> task attributes that may want to come into being: namely, low_watermark and
> high_watermark.  There is the potential, with attributes like this, for
> adding better out-of-the-box support for common data workflows that we now
> need to use xcom for, namely incremental loads.  But I want to give it more
> thought before proposing anything specific.
> 
> 
> 
> 
> 
> 
> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <Jarek.Potiuk@polidea.com>
> wrote:
> 
>> Good one Damian. I will have a list of issues that can be possible to
>> handle at the workshop, so that one goes there.
>> 
>> J.
>> 
>> Principal Software Engineer
>> Phone: +48660796129
>> 
>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
>> damian.shaw.2@credit-suisse.com> napisał:
>> 
>>> I can't understate what a conceptual improvement this would be for the
>> end
>>> users of Airflow in our environment. I've written a lot of code so all
>> our
>>> configuration works like this anyway. But the UI still shows the Airflow
>>> dates which still to this day sometimes confuse me.
>>> 
>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my first
>>> PRs could be additional test cases around edge cases to do with DST and
>>> cron scheduling that I have concerns about :)
>>> 
>>> Damian
>>> 
>>> -----Original Message-----
>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
>>> Sent: Friday, August 23, 2019 6:50 AM
>>> To: dev@airflow.apache.org
>>> Subject: Setting to add choice of schedule at end or schedule at start of
>>> interval
>>> 
>>> This has come up a few times before, someone has now opened a PR that
>>> makes this a global+per-dag setting:
>>> https://github.com/apache/airflow/pull/5787 and it also includes docs
>>> that I think does a good job of illustrating the two modes.
>>> 
>>> Does anyone object to this being merged? If no one says anything by
>> midday
>>> on Tuesday I will take that as assent and will merge it.
>>> 
>>> The docs from the PR included below.
>>> 
>>> Thanks,
>>> Ash
>>> 
>>> Scheduled Time vs Execution Time
>>> ''''''''''''''''''''''''''''''''
>>> 
>>> A DAG with a ``schedule_interval`` will execute once per interval. By
>>> default, the execution of a DAG will occur at the **end** of the
>>> schedule interval.
>>> 
>>> A few examples:
>>> 
>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that processes
>>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
>>> i.e. once that hour is over.
>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that processes
>>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
>>> 
>>> The reasoning behind this execution vs scheduling behaviour is that
>>> data for the interval to be processed won't be fully available until
>>> the interval has elapsed.
>>> 
>>> In cases where you wish the DAG to be executed at the **start** of the
>>> interval, specify ``schedule_at_interval_end=False``, either in
>>> ``airflow.cfg``, or on a per-DAG basis.
>>> 
>>> 
>>> 
>>> 
>> ===============================================================================
>>> 
>>> Please access the attached hyperlink for an important electronic
>>> communications disclaimer:
>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>> 
>> ===============================================================================
>>> 
>>> 
>> 

Mime
View raw message