airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Potiuk <Jarek.Pot...@polidea.com>
Subject Re: Setting to add choice of schedule at end or schedule at start of interval
Date Fri, 23 Aug 2019 13:10:39 GMT
DST: I recall problems with DST especially when the hour goes back and the
daily schedule time technically occurs twice the same day or does not occur
at all. We have some code that chooses arbitrary the first occurence in the
latter case (there was a problem that it worked differently python 3.6 vs
3.5 (!). But also the case when we move forward is an interesting one. I am
not 100% it will work correctly after changing the scheduling mechanisms
but it's rather easy to test and there is no harm adding it.
There is a DST-specific logic implemented in our next/previous run
calculation and I imagine it could get wrong.

The tests I am talking about:
DagTest.test_following_previous_schedule_daily_dag_CEST_to_CET/DagTest.test_following_previous_schedule_daily_dag_CET_to_CEST.

Re: arbitrary customisation/converting DAGs. I think there is no need to
convert existing dags - the default behaviour remains as it is as far as I
understand. And this flag is much simpler to understand and reason about
than arbitrary function and it corresponds to real business cases:

1) schedule_at_interval_end = True -> wait for the data to be ready for the
interval (current/default behaviour related to processing batches of data)
2) schedule_at_interval_end = False -> CRON-like behaviour where we simply
run arbitrary operation in regular intervals (more intuitive for people who
are used to CRON-like jobs)

You can always build your schedule differently if you need something
"in-between" IMHO.

J.




On Fri, Aug 23, 2019 at 8:44 AM James Meickle
<jmeickle@quantopian.com.invalid> wrote:

> This is a change to one of Airflow's core concepts, and it would require a
> lot of work for existing DAGs to cut over to it. Given that, my personal
> preference would be to allow arbitrary customization rather than just a bit
> toggle. Such as allowing passing in a mapping function: given an interval's
> start date and end date, when should it be executed?
>
> On Fri, Aug 23, 2019 at 8:24 AM Jarek Potiuk <Jarek.Potiuk@polidea.com>
> wrote:
>
> > Happy for it as well. There are a number of cases where scheduling at
> start
> > makes more sense and as we see Airflow is used now in multiple cases
> where
> > there is no need to process data from an interval and wait until that
> data
> > is ready.
> > But indeed some more tests would be great - especially for edge cases.
> > Changig mid-air is one but I think there should be test about Daylight
> > Saving Time changing.
> > There are some tests for DST so they just need to be extended to cover
> > those two different cases.
> >
> >
> > J.
> >
> > On Fri, Aug 23, 2019 at 7:37 AM Kaxil Naik <kaxilnaik@gmail.com> wrote:
> >
> > > Happy for this feature to merged
> > >
> > > On Fri, Aug 23, 2019, 11:49 Ash Berlin-Taylor <ash@apache.org> wrote:
> > >
> > > > This has come up a few times before, someone has now opened a PR that
> > > > makes this a global+per-dag setting:
> > > > https://github.com/apache/airflow/pull/5787 and it also includes
> docs
> > > > that I think does a good job of illustrating the two modes.
> > > >
> > > > Does anyone object to this being merged? If no one says anything by
> > > midday
> > > > on Tuesday I will take that as assent and will merge it.
> > > >
> > > > The docs from the PR included below.
> > > >
> > > > Thanks,
> > > > Ash
> > > >
> > > > Scheduled Time vs Execution Time
> > > > ''''''''''''''''''''''''''''''''
> > > >
> > > > A DAG with a ``schedule_interval`` will execute once per interval. By
> > > > default, the execution of a DAG will occur at the **end** of the
> > > > schedule interval.
> > > >
> > > > A few examples:
> > > >
> > > > - A DAG with ``schedule_interval='@hourly'``: The DAG run that
> > processes
> > > > 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
> > > > i.e. once that hour is over.
> > > > - A DAG with ``schedule_interval='@daily'``: The DAG run that
> processes
> > > > 2019-08-16 will start running shortly after 2019-08-17 00:00.
> > > >
> > > > The reasoning behind this execution vs scheduling behaviour is that
> > > > data for the interval to be processed won't be fully available until
> > > > the interval has elapsed.
> > > >
> > > > In cases where you wish the DAG to be executed at the **start** of
> the
> > > > interval, specify ``schedule_at_interval_end=False``, either in
> > > > ``airflow.cfg``, or on a per-DAG basis.
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message